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ABSTRACT 

This  paper  is  concerned  with  executing  programs  on  a  number  of 
processors.  Various  machine  configurations  are  discussed.  Algorithms 
for  transforming  programs  are  given  and  the  results  of  transforming  a 
number  of  real  programs  are  presented.  The  relationships  between 
machine  organization  and  program  organization  are  emphasized. 
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1 .  Introduction 

In  the  1965-75  decade  we  have  seen  a  number  of  changes  in 
computer  technology  and  computer  use.  Integrated  circuits  have  arrived  and 
with  them  have  come  large,  fast  semiconductor  memories;  microprocessors 
which  can  be  used  as  components;  and  the  potential  for  a  variety  of  new 
system  architectures.  Users  of  computers  in  this  period  have  become  quite 
concerned  about  the  reliability  of  their  hardware  and  software.  They  have 
also  come  to  expect  computer  services  to  fit  their  needs,  whether  this  be 
through  a  personal  minicomputer,  a  supercomputer  at  the  other  end  of  a 
network,  or  a  special  purpose  computer  in  their  computer  center. 

In  the  midsixties,  there  were  many  debates  about  which  direction 
computer  organization  would  go.  Stacks  vs.  registers,  binary  vs.  hexadecimal, 
time  sharing  vs.  batch  processing  vs.  remote  batch  were  all  being  discussed. 
Whether  fast  computers  should  be  organized  as  multiprocessors,  array 
processors,  pipeline  processors,  or  associative  processors,  was  widely 
discussed.  The  discussions  were  often  mainly  emotional,  with  no  substantive 
arguments  to  back  them  up.  Not  surprisingly,  everybody  won  in  many  of  these 
contests.  Indeed,  computers  are  wonderfully  flexible  devices  and  can  be 
twisted  into  many  forms  with  great  success. 

This  is  not  to  say  that  every   new  computer  organization  is  a  good 
idea  and  will  survive.  In  fact,  in  this  decade  the  entire  computer  divisions 
of  several  major  companies  have  failed.  Nor  is  it  to  say  that  we  lack  ideas 
of  universal  applicability.  As  examples,  hierarchies  of  virtual  memory  and 
microprogrammed  control  units  have  at  last  been  adopted,  if  not  discovered, 
by  just  about  everybody. 

In  the  midseventies,  one  can  still  hear  a  number  of  debates.  Some 
of  them  have  not  changed  from  the  midsixties.  There  are  also  new  ones  about 
how  to  make  reliable  hardware  and  software,  how  to  bring  computer  services 


to  ordinary  people,  and  how  to  exploit  ever  higher  levels  of  integrated 
circuit  technology.  The  latter  subject  obviously  provides  one  of  the  corner- 
stones of  the  whole  subject  of  computer  design  and  use. 

While  circuit  speeds  have  improved  in  the  past  decade,  their  costs 
have  improved  even  more.  Thus  designers  can  afford  to  use  more  and  more  gates 
in  computer  systems.  But  some  of  the  traditional  design  considerations  have 
changed.  At  the  present  time,  printed  circuit  board  and  wire  delays  often 
dominate  gate  delays  in  system  design.  Thus  computer  organization  itself  would 
now  seem  to  be  a  more  important  question  than  circuit  type  and  gate  count 
minimization  at  the  hardware  level  of  computer  design. 

It  is  also  clear  that  by  properly  organizing  a  machine,  various 
software  features  can  be  more  easily  supported.  Stacks,  extra  bits,  special 
registers  and  instructions  are  examples  of  this.  And  the  availability  of 
low  cost  integrated  circuits  makes  all  of  these  feasible,  even  in  low  cost 
machines. 

Besides  these  hardware  and  software  considerations,  a  computer 
designer  must  worry  about  what  applications  are  to  be  served  by  his  machine. 
In  the  days  of  truly  general  purpose  computers,  "Any  color  you  want  as  long 
as  it's  black,"  was  sufficient.  But  general  purpose  machines  in  this  sense, 
have  disappeared.  Even  the  IBM  360  series  provided  specialization  in  terms 
of  size  and  speed.  And  the  list  of  specialization  improvements  over  that 
series  is  wery   long.  It  ranges  from  minis  to  supers,  from  scientific  to 
banking  and  from  real  time  to  background  specializations. 

Some  people  would  argue  that  software  design  is  a  much  more 
important  question  than  computer  system  design.  As  support  they  would  offer 
the  skyrocketing  costs  of  software  and  the  sharply  dropping  hardware 


costs.  However,  these  observations  probably  support  the  converse  position 
even  more.  In  view  of  decreasing  hardware  costs,  how  can  we  better  organize 
machines  so  that  software  and  applications  are  better  and  less  expensively 
handled?  Indeed,  software  is  not  an  end  in  itself. 

One  of  the  main  points  of  confusion  and  disagreement  over  the 
design  of  better  hardware  and  software  has  always  been  in  deciding  on  goals. 
People  of  one  background  or  another  tend  to  have  biases  of  one  kind  or 
another.  And  an  entire  computer  and  software  system  has  so  much  complexity 
that  it  is  difficult  for  one  person  to  think  about  all  the  interrelating 
details.  Even  if  a  "perfect"  system  could  be  conceptualized  by  one  person, 
he  could  not  build  it  by  himself.  Many  people  must  be  involved.  Indeed 
this  has  been  the  downfall  of  many  systems,  the  first  such  having  been 
Charles  Babbage's  Analytical  Engine. 

The  above  remarks  are  well  understood  by  any  serious  hardware  or 
software  designer.  A  point  that  is  not  so  well  understood,  or  at  least  it 
is  widely  ignored,  is  that  a  good  system's  main  goal  is  to  serve  its  end 
users  well.  Many  computers  and  many  software  systems  are  designed  from 
beginning  to  end  with  only  a  nod  to  the  ultimate  users,  the  main  design 
goals  being  whatever  "improvements"  the  designers  can  make  over  their 
previous  designs.  Such  "improvements"  are  often  intrinsic  to  the  hardware 
or  software  and  may  or  may  not  be  reflected  in  what  the  end  user  sees. 

The  standard  way  to  design  any  big  system;  hardware,  software, 
highway  or  bridge,  is  to  break  it  up  into  hierarchies  and  subparts.  When 
parts  which  are  analytically  tractable  are  found,  the  proper  analysis  provides 
a  solution  for  that  part.  Other  parts  are  solved  using  the  best  intuition 
the  designer  has. 


A  key  question  in  improving  the  system  design  procedure  would 
seem  to  be  the  following.  How  can  we  integrate  users'  problems  into  the 
design  procedure?  The  answer  to  this  is  not  obvious.  Usually,  users  do 
not  know  exactly  what  they  want.  They  often  know  what  was  wrong  with  their 
previous  system.  But  solving  these  problems  is  often  similar  to  making 
improvements  of  an  intrinsic  hardware  or  software  nature  as  mentioned 
above.  They  do  not  lead  to  a  global,  system  improvement. 

A  partial  answer  to  what  the  user  wants  may  be  found  in  looking 
at  the  programs  he  runs.  While  these  may  not  reflect  exactly  the  algorithms 
he  wishes  to  run,  they  at  least  provide  an  objective  measure  of  what  he  is 
doing.  If  a  system  could  handle  these  programs  well,  then  at  least  for  a 
short  while,  the  user  would  be  happy. 

In  this  paper  we  will  consider  several  aspects  of  the  problem  of 
integrating  the  analysis  of  users'  programs  with  the  design  of  computer 
systems.  To  tackle  the  problem  in  a  concrete  way,  it  is  reasonable  to 
restrict  our  initial  study.  We  will  first  deal  with  Fortran  programs 
because  there  are  many  of  them  in  existence  and  because  the  language  is 
quite  simple.  Our  primary  design  goal  will  be  the  fast  execution  of 
programs.  This  goal  is  indeed  probably  the  primary  objective  of  "users" 
ranging  from  computer  center  managers  who  want  high  system  throughput  to 
individual  programmers  who  want  fast  turnaround.  Of  course,  reliability, 
ease  of  use,  quality  of  results  and  so  on  are  also  important,  but  we  will 
deal  with  one  problem  at  a  time.  The  ideas  we  will  discuss  are  applicable 
to  programs  in  languages  other  than  Fortran--we  use  it  because  it  is  popular 
and  of  relatively  low  level--as  we  shall  mention  in  section  5. 


Speed  Limits 

What  are  the  factors  that  limit  the  speed  of  machine  computation? 
Or,  to  sharpen  the  question  a  bit:  Given  a  set  of  programs,  what  determines 
how  fast  they  can  be  executed?  Basically,  there  are  two  aspects  to  the 
answer.  One  aspect  concerns  the  physics  of  the  hardware  being  used.  The 
other  aspect  concerns  the  logic  of  the  machine  organization  and  the  program 
organization. 

The  physical  limits  on  computer  speed  are  rather  widely  understood. 
Gates  can  switch  in  some  fixed  time,  and  we  must  pay  for  the  sum  of  a  number 
of  gate  delays  to  perform  various  computer  operations.  Additionally,  signals 
must  propagate  along  printed  circuit  boards  and  on  wires  between  boards,  and 
these  delays  are  often  larger  than  gate  delays.  We  shall  not  concern  ourselves 
with  these  questions.  Rather,  we  shall  assume  that  circuit  speeds  are  fixed 
and  consider  the  logical  problems  involved. 

The  logic  of  machine  organization  has  been  studied  for  many  years. 
Greater  machine  speed  through  simultaneity  or  parallelism  has  been  a  subject 
of  much  study.  Parallelism  has  been  used  at  many  levels,  between  bits, 
between  words,  between  functions,  and  so  on.  Shortly  we  shall  give  more 
details  of  this. 

The  relations  between  the  organization  of  a  machine  and  the 
organization  of  a  program  to  be  executed  on  that  machine  have  not  been 
studied  much.  Of  course,  the  compilation  of  a  Fortran  program  for  execution 
on  a  serial  computer  is  a  special  case  of  this.  But  compiler  theory  has 
mainly  been  developed  with  respect  to  languages.  The  semantic  or  machine 
related  aspects  of  compilation  are  usually  handled  in  ad  hoc  ways. 

Beyond  this,  we  are  really  concerned  with  the  syntactic  transformation 


of  algorithms  given  in  the  form  of  programs,  into  forms  which  exhibit  high 
amounts  of  parallelism.  At  the  same  time,  we  are  interested  in  clarifying 
what  kinds  of  machine  organizations  correspond  to  the  parallel  forms  of 
programs.  Thus  we  seek  program  transformations  and  machine  organizations 
which  together  allow  for  high  speed  and  efficient  execution  of  any  given 
serial  programs. 

A  properly  developed  theory  will  have  a  number  of  benefits.  For 
one  thing,  it  will  allow  us  to  see  what  the  logical  limits  of  computation 
speed  are  (in  contrast  to  the  physical  limits),  and  to  see  how  close  to 
them  we  are  operating.  It  will  also  give  us  constructive  procedures  for 
designing  machines  and  compilers  for  those  machines.  Another  benefit,  which 
we  discuss  below,  is  that  we  can  obtain  a  unified  approach  to  logic  design 
and  compiler  design,  since  abstractly,  many  of  the  problems  are  identical. 


Our  analysis  of  Fortran-like  programs  can  be  carried  out  at  several 
levels.  First,  we  can  consider  the  elementary  statements  in  the  language, 
e.g.,  assignment,  IF,  DO,  etc.  Then  we  can  consider  whole  programs  and 
see  how  these  statements  fit  together.  This  can  be  done  at  an  abstract 
level  and  also  by  studying  real  programs  using  the  abstract  theory.  The 
paper  also  contains  several  discussions  of  algorithms  in  other  programming 
languages,  but  we  will  not  develop  these  points  at  much  length  here. 

With  a  good  understanding  of  the  structure  of  programs  behind  us, 
it  is  proper  to  consider  machine  organizations.  In  this  paper  we  will  mainly 
discuss  processor,  switch  and  primary  memory  design.  Control  units  and 
memory  hierarchies  are  being  studied  in  a  similar  way  but  these  areas  are 
not  as  well  developed  at  this  point. 

Our  long  term  objective  is  to  develop  methods  for  the  rational 
design  of  computer  systems  which  are  well  matched  to  the  classes  of  programs 
they  are  to  execute.  By  developing  our  ideas  theoretically,  we  can  see 
what  our  ultimate  objectives  in  terms  of  bounds  might  be.  We  can  also 
observe  that  several  dissimilar  aspects  of  computer  system  design  consist  of 
ideas  which  are  identical  at  a  theoretical  level.  Thus  a  coherent  body  of 
theoretical  material  can  be  used  at  the  logic  design  level  and  also  at  the 
compiler  design  level. 


Logic  Design  and  Compiler  Uniformity 

To  give  an  intuitive  overview  of  our  ideas,  let  us  begin  with  a 
few  simple  examples.  The  basic  question  is,  how  fast  can  we  carry  out 
certain  functions. 

First,  consider  the  problem  of  performing  logical  operations  on 
two  n-bit  computer  words  a  =  (a-|  ...  a  )  and  b  =  (b-,  ...  b  ).  If  we  have 

a  =  (101101) 
and  b  =  (011001) 
then  the  result  of  a  logical  OR  defined  as  (a.  +  b.)  is 

c  =  (111101) 
and  the  result  of  a  logical  AND  defined  as  (a.  •  b.)  is 

d  =  (001001)  . 
Note  that  either  the  AND  or  the  OR  function  is  performed  on  pairs  of  bits  of 
a  and  b,  independently  of  all  other  bits  in  the  words.  Hence  it  is  obvious 
that  either  of  these  functions  can  be  computed  in  a  time  (say,  one  gate  delay) 
which  is  independent  of  the  number  of  bits  involved.  This  assumes  that  we 
can  use  as  many  gates  as  we  need;  in  this  case  the  number  of  AND  and  OR  gates 
will  be  proportional  to  n. 

Now  let  us  turn  our  attention  to  arithmetic  operations  rather 
than  logical  operations.  Again  consider  a  =  (a-, ,  . . . ,  a  )  and 

b  =  (b,,  ...,  b  ),  but  now  let  the  a.  and  b.  be  full  computer  words  each 

III  I 

representing  a  number.  If  we  have 

a  =  (3,5,2,1,0,7) 
and      b  =  (1,2,3,4,5,6)  , 
then  the  result  of  a  vector  add  is 

c  =  (4,7,5,5,5,13) 


and  the  result  of  a  vector  multiply  is 

d  =  (3,10,6,4,0,42)  . 
Just  as  in  the  logical  case,  we  can  perform  all  of  the  above  arithmetic 
operations  independently  of  one  another.  Thus,  regardless  of  n,  the 
dimension  of  the  vectors,  we  can  form  c  in  one  add  time  or  d  in  one  multiply 
time.  This  assumes  that  we  have  n  adders  or  n  multipliers  available. 

Next,  let  us  consider  some  more  difficult  problems  at  the  bit  and 
arithmetic  level . 

Suppose  we  have  one  computer  word  a  =  (a,  ...  a  )  in  which  bit  a. 

corresponds  to  the  occurrence  of  some  event.  In  other  words,  let  a.  =  1  if 

event  e.  has  occurred  and  a.  =  0  otherwise.  Further,  let  b  and  c  be  one- 
bit  indicators  defined  as  follows.  If  any  of  events  e,  ...  e  have  occurred, 

we  want  b  to  be  1 ,  otherwise  b  =  0.  And  if  all  of  events  e-,  ...  e  have 

occurred,  we  want  c  to  be  1 ,  otherwise  c  =  0. 

What  are  the  fastest  possible  designs  for  logical  circuits  which 
compute  b  and  c?  It  is  intuitively  clear  that  these  problems  are  more  difficult 
than  those  we  discussed  above,  namely,  the  pairwise  logical  AND  and  OR 
problems.  Here,  all  bits  in  the  word  a  are  involved  in  the  computation  of 
b  and  c.  A  simple  way  to  solve  this,  which  also  turns  out  to  be  the  fastest, 
is  the  following.  To  form  b,  we  compute  a,  +  a?,  a~  +  a.,  ...,  a  -,  +  a 

simultaneously  using  n/2  OR  gates  (assuming  n  =  2k).  Then  we  compute 
(a-|  +  a2)  +  (a3  +  a,)  and  so  on,  fanning  in  the  result  to  a  single  result 
bit  b,  as  shown  in  Figure  1.  If  we  replace  the  logical  OR  by  a  logical  AND 
in  the  above  discussion  we  form  the  result  c. 

It  is  not  difficult  to  prove  that  for  such  problems,  this  kind  of 


10 


fan-in  approach  yields  the  best  possible  result.  The  technique  is  useful  in 
many  logic  design  problems. 

Now  we  consider  the  arithmetic  problems  corresponding  to  the  above 
logic  design  questions.  If 

a  =  (a1 ,  ...,  an) 

is  a  vector  of  n  numbers  stored  in  a  computer,  suppose  we  want  to  compute 
the  sum  and  product 


b  =  l     a 


c  =  7T  a. 


1-1   '         i=l   ' 
Instead  of  dealing  with  gates,  we  must  now  consider  adders  or  multipliers 
as  our  basic  building  blocks.  Again,  the  best  solutions  to  these  problems 
are  obtained  by  simply  fanning  in  the  arguments.  The  tree  of  Figure  1 
illustrates  this.  If  we  now  interpret  the  +  as  an  arithmetic  addition,  the 
result  is  b,  and  similarly  for  c. 

We  see  from  the  above  discussion  that  for  a  class  of  computations 
which  require  the  interaction  of  more  than  two  data  elements,  more  time  is 
required  than  was  needed  by  our  first  type  of  computation.  In  particular, 
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for  these  calculations,  if  n  arguments  are  involved  then  we  need  po^nl 
operation  times  to  compute  the  result.  An  operation  time  may  be  a  logical 
OR  or  AND,  or  it  may  be  an  arithmetic  add  or  multiply. 

Finally,  let  us  consider  an  even  messier  kind  of  computation. 
Suppose  we  have  two  n  bit  words 

a  =  (a]  ...  ap) 

and      b  =  (b-,  ...  bn) 

and  we  want  to  compute 

fo  if  i  =  0 

1   1  a.  +  b.  c.  -I     if  1  <_  i  <_  n 

This  may  seem  to  be  a  strange  logic  design  problem.  It  is  not  yery   unusual, 
however,  since  it  forms  the  heart  of  the  design  of  a  binary  adder.  In 
particular,  this  recurrence  relation  accounts  for  carry  generation  which  is 
the  main  time  delay  in  an  adder  circuit.  How  much  time  is  required  to  compute 
the  vector  of  carry  bits  c? 

The  solution  of  this  problem  is  not  as  obvious  as  were  the  solutions 
of  our  earlier  problems.  Before  discussing  how  to  solve  it,  let  us  consider 
an  analogous  arithmetic  problem.  It  frequently  occurs  in  numerical  computation 
that  we  wish  to  evaluate  a  polynomial. 

P(x)  =  an  +  an_1  x  +an_2  x2  +  ...  +  aQ  xn  . 

Traditionally,  we  are  told  to  do  this  using  Horner's  rule 

h(x)  =  an  +  x(an_1  +  x(an_2  +  ...  +  x(a]  +  xaQ)  ...))       (l  ) 

since  it  requires  only  0(n)  operations.  This  can  be  restated  as  a  program  of 
the  form 
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P  =  A(0) 

FOR  I  =  1  TO  N 

P  =  A(I)  +  X  *  P 
which  computes  the  expression  h(x)  from  inside  out  in  N  iterations. 

It  is  clear  that  both  problems  share  an  important  property. 
Both  are  recurrences  in  which  step  i  depends  on  the  result  which  was  computed 
during  step  1-1.  This  may  initially  give  us  the  sinking  feeling  that  no 
speedup  is  possible  here.  To  show  that  0(n)  steps  are  not  required  to 
compute  such  linear  recurrence  functions  is  in  general  a  nontrivial  problem 
which  has  been  studied  in  many  forms.  We  will  give  more  attention  to  this 
at  the  program  level  later.  At  the  logic  design  level,  it  is  discussed  in 
Chen  and  Kuck  [21],  where  algorithms  are  given  for  transforming  any  linear 
sequential  circuit  specification  into  a  fast  combinational  circuit.  Time 
and  component  bounds  are  given  for  such  circuits  as  adders,  multipliers 
and  one's  position  counters,  which  compare  favorably  with  those  derived  by 
traditional  methods. 
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We  can  derive  two  important  theoretical  problems  from  the  above. 
One  is  tree-height  reduction  for  arithmetic  expression  parse  trees  or  for 
combinational  logic  expressions.  The  other  is  the  fast  solution  of  linear 
recurrences  derived  either  from  programs  or  from  logic  design  problems.  If 
we  have  fast,  efficient  procedures  for  solving  both  types  of  problems  in  a 
parallel  way,  we  will  have  a  good  understanding  of  important  theoretical 
aspects  of  compiler  writing  and  logic  design  automation,  respectively. 

In  section  2,  we  will  discuss  some  details  of  these  problems.  It 
can,  in  fact,  be  shown  that  any  arithmetic  expression  or  simple  linear  recurrence 
containing  0(n)  arguments  can  be  solved  in  0(log  n)  time  steps.  The  width 
of  the  tree  (number  of  processors)  needed  is  just  0(n)  in  either  case.  Thus 
for  any  of  these  calculations,  which  require  0(n)  time  steps  if  performed 
serially,  we  can  speed  them  up  by  a  factor  of  0(n/log  n),  using  just  0(n) 
processors.  As  we  saw  above,  some  computations  can  be  speeded  up  even  more 
(e.g.,  by  0(n)).  And  as  we  shall  see  in  section  2,  the  best  known  speedups 
for  some  computations  are  much  less  than  this. 

Using  this  theoretical  background,  we  will  turn  our  attention  to 
the  analysis  of  whole  programs  in  section  3.  We  will  consider  transformations 
of  blocks  of  assignment  statements,  loops,  conditional  statements  and  program 
graphs.  Algorithms  for  such  transformations,  as  well  as  resulting  time  and 
processor  bounds,  will  be  discussed.  Such  algorithms  can  serve  as  the  basis 
for  program  measurement  to  aid  in  the  design  of  effective  machines.  They  can 
also  be  used  as  a  model  of  a  compiler  for  parallel  or  pipeline  computers. 
Above  the  bit-level  of  logic  design,  these  are  our  primary  motivations,  but 
we  can  also  interpret  our  work  in  several  other  ways.  Since  we  are  really 
engaged  in  a  study  of  the  structure  of  programs,  our  results  seem  useful  in 
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the  contexts  of  structured  programming  (e.g.,  since  we  remove  GOTOs)  and 
also  memory  hierarchy  management  (e.g.,  since  we  can  reduce  program  page 
space-time  products). 

In  section  4,  we  discuss  some  aspects  of  real  computers,  including 
the  accessing,  aligning,  and  processing  of  data.  We  also  sketch  some 
results  from  our  analysis  of  a  number  of  real  Fortran  programs.  To 
relate  parallel  machine  organizations  to  algorithm  organizations,  we  give 
a  cost/effectiveness  measure  and  a  number  of  examples  of  its  use. 
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2.  Theoretical  Fundamentals 

The  two  basic  building  blocks  of  any  numerical  programs  are 
arithmetic  expressions  and  linear  recurrences.  In  this  section  we  will 
give  upper  bounds  on  the  time  and  number  of  processors  needed  for  the  fast 
parallel  evaluation  of  both  of  these.  In  order  to  keep  our  discussion 
simple,  we  will  (in  this  section)  ignore  memory  and  data  alignment  times 
as  well  as  control  unit  activity.  Although  we  will  bound  the  number,  we 
assume  as  many  processors  as  needed  are  available. 

We  will  assume  that  each  arithmetic  operation  takes  one  unit  of 
time.  Our  recurrence  methods  allow  all  processors  to  perform  the  same 
operation  at  the  same  time.  This  SIMD  (single  instruction,  multiple  data 
[31])  operation  is  the  simplest  for  a  parallel  or  pipeline  machine  organiza- 
tion to  perform.  Our  arithmetic  expression  bounds  assume  that  different 
processors  can  perform  different  operations  at  the  same  time.  This  MIMD 
(multiple  instruction,  multiple  data  [31])  behavior  assumes  a  more  complex 
control  unit.  However,  it  is  obvious  that  the  bounds  need  be  adjusted  by 
only  a  small  constant  to  allow  them  to  be  used  for  SIMD  machines.  In  the 
worst  case  we  can  assume  a  machine  which  simply  cycles  through  each  of  the 
four  arithmetic  operations  on  each  "macro-step",  although  more  delicate 
schemes  are  easy  to  devise.  In  any  case,  most  of  our  speedup  in  most 
programs  comes  from  the  speedup  of  linear  recurrences.  Subscripted 
arithmetic  expressions  (which  are  not  recurrences)  inside  loops  can  simply 
be  handled  as  trees  of  arrays,  so  SIMD  operation  holds. 

We  will  give  a  number  of  results  about  tree-height  reduction  first. 
This  theory  has  been  well  developed  and  we  will  give  more  results  than  are 
justified  for  practical  compilation.  But  we  will  indulge  ourselves  a  bit, 
since  the  material  is  interesting  in  an  abstract  sense,  at  least. 
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If  T  is  the  number  of  unit  time  steps  required  to  perform  some 
calculation  using  p  >_  1  processors,  we  define  the  speedup  of  the  p  processor 
calculation  over  a  uniprocessor  as 

Tl  Sd 

S  =  =—  >  1  and  we  define  the  efficiency  of  the  calculation  as  E  =  -^  <  1 
P   Tp  -  J  p   p  - 

which  may  be  regarded  as  actual  speedup  divided  by  the  maximum  possible 
speedup  using  p  processors.  For  various  computations  we  will  discuss  the 
maximum  possible  speedup  known  according  to  some  algorithm  and  in  such  cases 
we  use  P  to  denote  the  minimum  number  of  processors  known  to  achieve  this 
maximum  speedup.  In  such  cases  we  will  use  the  notation  Tp,  Sp  and  Ep  to 
denote  the  corresponding  time,  speedup  and  efficiency,  respectively. 

Time  and  processor  bounds  for  some  computation  A  will  be  expressed 
as  Tp[A]  and  P[A]  in  the  minimum  time  cases  and  T  [A]  in  the  restricted 
processor  (p  <  P)  case.  When  no  ambiguity  can  result,  we  will  write  T[A] 
or  just  T  in  place  of  Tp[A]  and  P  in  place  of  P[A],  for  simplicity.  We 
write  log  x  to  denote  log-  x  and  Txl  for  the  ceiling  of  x. 

Arithmetic  Expression  Tree-Height  Reduction 

Now  we  consider  time  and  processor  bounds  for  arithmetic  expression 
evaluation.  We  restrict  our  attention  to  transforming  expressions  using 
associativity,  commutativity  and  distributivity  which  leads  us  to  speedups 

of  0(,    )  at  efficiencies  of  0(1 /log  n).  Since  this  is  asymptotic  to  the 

best  possible  speedup,  more  complex  transformations  (e.g.,  factoring,  partial 
fraction  expansion)  seem  unnecessary. 


Definition  1 


An  arithmetic  expression  is  any  well -formed  string  composed  of  the 
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four  arithmetic  operations  (+,-,*, /),  left  and  right  parentheses,  and  atoms 
which  are  constants  or  variables.  We  denote  an  arithmetic  expression  E  of  n 
distinct  atoms  by  E<n>. 

If  we  use  one  processor,  then  the  evaluation  of  an  expression 
containing  n  operands  requires  n  -  1  units  of  time.  But  suppose  we  may  use 
as  many  processors  as  we  wish.  Then  it  is  obvious  that  some  expressions 
E<n>  may  be  evaluated  in  log^n  units  of  time  as  illustrated  in  Fig.  1.  In 
fact,  we  can  establish,  by  a  simple  fan-in  argument,  the  following  lower  bound: 

Lemma  1        Given  any  arithmetic  expression  E<n> 

T[E<n>]  >  [log  n]  . 

On  the  other  hand,  it  is  easy  to  construct  expressions  E<n>  whose 
evaluation  appears  to  require  0(n)  time  units  regardless  of  the  number  of 
processors  available.  Consider  the  evaluation  of  a  polynomial  by  Horner's 
rule  as  in  section  1.  A  strict  sequential  order  is  imposed  by  the 
parentheses  in  Eq.  1  and  more  processors  than  one  are  of  no  use  in  speeding 
up  this  expression's  evaluation. 

However,  we  are  not  restricted  to  dealing  with  arithmetic  expressions 
as  they  are  presented  to  us.  For  example,  the  associative,  commutative,  and 
distributive  laws  of  arithmetic  operations  may  be  used  to  transform  a  given 
expression  into  a  form  which  is  numerically  equivalent  to  the  original  but 
which  may  be  evaluated  more  quickly.  We  now  consider  examples  of  each  of  these, 

Fig.  2a  shows  the  only  parse  tree  possible  (except  for  isomorphic 
images)  for  the  expression  (((a  +  b)  +  c)  +  d) .  This  tree  requires  three 
steps  for  its  evaluation  and  we  refer  to  this  as  a  tree  height  of  three. 
However,  by  using  the  associative  law  for  addition  we  may  rearrange  the 
parentheses  and  transform  this  to  the  expression  (a  +  b)  +  (c  +  d)  which  may 
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be  evaluated  as  shown  in  Fig.  2b  with  a  tree  height  of  two.  It  should  be 
noted  that  in  both  cases,  three  addition  operations  are  performed. 

Fig.  3a  shows  a  parse  tree  for  the  expression  a  +  be  +  d;  again 
we  have  a  tree  of  height  three.  In  this  case  the  tree  is  not  unique,  but 
it  is  obvious  that  no  lower  height  tree  can  be  found  for  the  expression 
by  use  of  associativity.  But  by  use  of  the  commutative  law  for  addition  we 
obtain  the  expression  a  +  d  +  be  and  the  tree  of  Fig.  3b,  whose  height  is 
just  two.  Again  we  remark  that  both  trees  contain  three  operations. 

Now  consider  the  expression  a(bcd  +  e)  and  the  tree  for  it  given  in 
Fig.  4a.  This  tree  has  height  four  and  contains  four  operations.  By  use  of 
associativity  and  commutativity,  no  lower  height  tree  can  be  found.  But, 
using  the  arithmetic  law  for  the  distribution  of  multiplication  over  addition 
we  obtain  the  expression  abed  +  ae,  which  has  a  tree  of  minimum  height  three, 
as  shown  in  Fig.  4b.  However,  unlike  the  two  previous  transformations, 
distribution  has  introduced  an  extra  operation;  the  tree  of  Fig.  4b  has  five 
operations  compared  to  the  four  operations  of  the  undistributed  form. 

Having  seen  a  few  examples  of  arithmetic  expression  tree-height 
reduction,  we  are  naturally  led  to  ask  a  number  of  questions.  For  any 
arithmetic  expression,  how  much  tree-height  reduction  can  be  achieved?  Can 
general  bounds  and  algorithms  for  tree-height  reduction  be  given?  How  many 
processors  are  needed? 

To  answer  these  questions,  we  present  a  brief  survey  of  results 
concerning  the  evaluation  of  arithmetic  expressions.  Details  and  further 
references  may  be  found  in  the  papers  cited.  Assuming  that  only  associativity 
and  commutativity  are  used  to  transform  expressions,  Baer  and  Bovet  [2] 
gave  a  comprehensive  tree-height  reduction  algorithm  based  on  a  number  of 
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earlier  papers.  Beatty  [7  ]  showed  the  optimal ity  of  this  method.  An 
upper  bound  on  the  reduced  tree  height  assuming  only  associativity  and 
commutativity  are  used,  given  by  Kuck  and  Muraoka  [43]>  is  the  following. 

Theorem  1       Let  E<n|d>  be  any  arithmetic  expression  with  depth  d  of 
parenthesis  nesting.  By  the  use  of  associativity  and  commutativity  only, 
E<n|d>  can  be  transformed  such  that 

Tp[E<n|d>]  <  Tlog  nl  +  2d  +  1 
with  r     n 

Note  that  if  the  depth  of  parenthesis  nesting  d,  is  small,  then 
this  bound  is  quite  close  to  the  lower  bound  of  Tlog  nl.  The  complexity  of 
this  algorithm  has  been  studied  in  [13],  where  it  is  shown  that  in  addition 
to  the  standard  parsing  time,  tree-height  reduction  can  be  performed  in  0(n) 
steps.  Unfortunately,  there  are  classes  of  expressions,  e.g.,  Horner's  rule 
polynomials  or  continued  fractions  for  which  no  speed  increase  can  be  achieved 
by  using  only  associativity  and  commutativity. 

Muraoka  [54]  studied  the  use  of  distributivity  as  well  as 
associativity  and  commutativity  for  tree-height  reduction  and  developed 
comprehensive  tree-height  reduction  algorithms  using  all  three  transformations 
An  algorithm  which  considers  operations  which  take  different  amounts  of  time 
is  presented  by  Kraska  [37]. 

Bounds  using  associativity,  commutativity  and  distributivity  have 
been  given  by  a  number  of  people  [12,42,53].  In  [12]  the  following  theorem 
is  proved. 

Theorem  2       Given  any  expression  E<n>,  by  the  use  of  associativity, 
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commutativity  and  distributivity,  E<n>  can  be  transformed  such  that 

Tp[E<n>]  <  T41og  nl 
with 

P  <  3n. 

The  complexity  of  the  algorithm  of  [12]  has  been  studied  in  [13], 
where  it  is  shown  that  tree-height  reduction  can  be  done  using  0(n  log  n) 
steps  in  addition  to  normal  parsing.  Also  if  the  number  of  processors  is 
allowed  to  grow  beyond  0(n),  the  time  coefficient  of  Theorem  2  has  been 
reduced  to  2.88  by  Muller  and  Preparata  [53]. 

A  number  of  other  results  are  available  for  arithmetic  expressions 
of  special  forms  or  for  general  expressions  if  more  information  is  known 
about  them.  In  [42],  expressions  without  division,  continued  fractions, 
general  expressions  with  a  known  number  of  parenthesis  pairs  or  division 
operations,  and  other  such  cases  are  considered.  Polynomials  are  discussed 
in  this  paper  and  earlier  by  Maruyama  in  [51]. 

One  other  case  should  be  mentioned  here.  For  programming  languages 
with  array  operators,  other  compilation  techniques  may  be  of  interest.  For 
example,  [55]  solves  the  problem  of  minimizing  the  time  to  evaluate  the 
product  of  a  sequence  of  conformable  arrays  on  a  parallel  machine.  In  [42] 
it  is  shown  that  any  matrix  expression  including  addition,  subtraction, 
multiplication  and  matrix  inversion  can  be  handled  as  follows.  If  any  of 
these  four  operations  take  one  matrix  operation  time  step,  then  any  matrix 
expression  of  n  arrays  can  be  evaluated  in  61og  n  matrix  operation  steps. 
The  coefficient  is  the  sum  of  three  addition  times,  two  multiplication  times 
and  one  inversion  time.  Matrix  addition  and  multiplication  are  straightforward, 
but  the  time  required  to  invert  a  matrix  measured  in  standard  arithmetic 
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operations  varies,  depending  on  the  method  used  (c.f.,  [62]). 

Most  arithmetic  expressions  appearing  in  real  programs  have  a 
rather  small  number  of  atoms.  If  the  atoms  are  subscripted,  then  the 
arrays  may  be  quite  large,  but  it  is  usually  advisable  to  evaluate  these 
as  a  tree  of  arrays,  one  array  operation  at  a  time.  If  tree-height 
reduction  techniques  are  used  on  such  expressions,  there  are  two  possibly 
bad  consequences.  One  is  the  passing  from  SIMD  to  MIMD  operation  as  dis- 
cussed earlier.  The  other  is  that  redundant  operations  are  generally 
introduced,  making  the  overall  computation  less  efficient.  However,  for 
expressions  outside  loops,  for  unsubscripted  expressions  inside  loops  or 
for  expressions  of  small  arrays  inside  loops,  tree-height  reduction  can  be 
of  value. 

The  number  of  processors  required  to  evaluate  such  expressions  is 
usually  less  than  are  required  for  recurrence  solving,  as  we  shall  see 
later.  However,  there  may  be  cases  where  tree-height  reduction  is  desirable, 
but  the  number  of  available  processors  is  wery   small.  The  following  results 
cover  this  case  and  are  theoretically  interesting. 

Corollary  1     Given  any  expression  E<n>  and  p  processors  for  its 
evaluation,  by  the  use  of  associativity,  commutativity  and  distributivity, 
E<n>  can  be  transformed  such  that 

Tp[E<n>]  £41og  n  +  10(n-l)/p  . 

This  is  a  corollary  of  Theorem  2  and  was  proved  by  Brent  [12]. 

This  result  has  been  improved  by  Winograd  [69],  who  shows  that  if  p  processors 

5n       2 
are  available,  we  can  evaluate  any  E<n>  in  y-  +   0(log  n)  steps.  For  small  p, 

this  result  is  an  improvement  on  Corollary  1. 
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We  have  given  a  number  of  different  upper  bounds  on  time  and 
processors  for  arithmetic  expression  evaluation.  While  the  only  lower  bound 
(Lemma  1)  is  naive,  the  upper  bounds  are  close  enough  to  it  for  practical 
purposes.  It  is  clear  that  the  theory  is  quite  well  developed  and  improve- 
ments on  these  results  will  be  quite  difficult  to  obtain.  We  conclude 
that  any  arithmetic  expression  E<n>  can  be  evaluated  in  0(log  n)  steps  at  an 
efficiency  of  0(1 /log  n). 
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Recurrence  Relations 

Linear  recurrences  share  with  arithmetic  expressions  a  role  of 
central  importance  in  computer  design  and  use.  But  they  are  somewhat  more 
difficult  to  deal  with.  While  an  expression  specifies  a  static  computational 
scheme  for  a  scalar  result,  a  recurrence  specifies  a  dynamic  procedure  for 
computing  a  scalar  or  an  array  of  results.  Linear  recurrences  are  found 
in  computer  design,  numerical  analysis  and  program  analysis,  so  it  is 
important  to  find  fast,  efficient  ways  to  solve  them. 

Recurrences  arise  in  any  logic  design  problem  which  is  expressed 
as  a  sequential  machine.  Also,  almost  every   practical  program  which  has 
an  iterative  loop  contains  a  recurrence.  While  not  all  recurrences  are 
linear,  the  vast  majority  found  in  practice  are,  and  we  shall  concentrate 
first  on  linear  recurrences. 

We  shall  begin  with  several  examples.  First,  consider  the  problem 
of  computing  an  inner  product  of  vectors  a  =  (a,,..., a  )  and  b  =  (b,,...,b  ). 

This  can  be  written  as  a  linear  recurrence  of  the  form 

x  =  x  +  a^b.,    1  <  i  <_  n  (2) 

where  x  is  initially  set  to  zero  and  finally  set  to  the  value  of  the  inner 
product  of  a  and  b. 

As  another  example  of  a  linear  recurrence  which  produces  a  scalar 
result,  the  evaluation  of  a  degree  n  polynomial  p  (x)  in  Horner's  rule  form 

can  be  expressed  as 

p  =  a.  +■  xp,     2  <_  i  £  n  (3) 

where  p  is  initially  set  to  a,  and  finally  set  to  the  value  of  Pn(x)- 
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Techniques  to  handle  both  of  these  recurrences  should  be 
familiar  from  our  discussion  of  expression  evaluation.  Note  that  Eq.  2 
can  be  expanded  by  substituting  the  right-hand  side  into  itself  (statement 
substitution)  as  follows: 

x  =  a,b, 

x  =  a-,b,  +  a2b2 

x  =  a,b,  +  a^bp  +  a3b~ 


After  n  iterations  we  have  an  expression  which  can  be  mapped  onto  a  tree 
similar  to  that  of  Figure  1 . 

Earlier,  we  have  also  discussed  polynomial  evaluation.  Thus 
by  carrying  out  a  procedure  similar  to  the  above,  we  could  obtain  an  expression 
which  could  be  handled  by  tree-height  reduction.  Thus,  we  would  expect 
that  these  and  similar  recurrences  could  be  solved  in  Tp  =  0(log  n)  time 

steps  using  P  =  0(n)  processors. 

But  there  are  other,  more  difficult  looking  linear  recurrences. 
For  example,  a  Fibonacci  sequence  can  be  generated  by 

fi  =  fi_!  +  fi_2     3  <  i  <  n  ,(4) 

where    f ,  =  f 2  =  1 . 

As  another  example,  consider  the  addition  of  two  n-bit  binary  numbers 

a  =  a  ...  a,  and  b  =  b  ...  b, .  The  propagation  of  the  carry  across    the 

sum  can  be  described  by 

ci  =  yi  +  xi'ci-l     1  <  i  <  n  (5^ 

where  N   cQ  =  0,  x,  =  a.  +  b^  and  y.  =  a.-b.  . 
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Here  we  use  +  to  denote  logical  or  and  •  to  denote  logical  and.  This  is 
an  example  of  a  bit  level  linear  recurrence,  in  contrast  to  our  previous 
examples  whose  arguments  were  assumed  to  be  real  numbers. 

In  both  Eq.  4  and  Eq.  5  we  are  required  to  generate  a  vector 
result  because  of  the  subscripted  left-hand  side.  This  is  in  contrast  to 
the  scalar  results  of  Eqs.  2  and  3.  Because  of  this,  we  can  expect  a  good 
deal  more  difficulty  in  trying  to  obtain  a  fast  efficient  solution  to  these 
recurrences.  With  the  above  as  an  introduction,  we  now  turn  to  a  formalization 
of  the  general  problem.  We  will  then  give  bounds  for  the  solution  of  the 
general  problem  and  several  important  special  cases. 

Definition  2 

An  m-th  order  linear  recurrence  system  of  n  equations,  R<n,m> 
is  defined  for  m  <_  n  by 

x^  e  0     for  i  £  0 

i-1 

and      x.  -  c.  +  I      a. .  x.    for  1  <  i  <_  n  . 

j-i-m   J  J 

If  m  B  n  we  call  the  system  a  general  linear  recurrence  system  and  denote  it 

by  R<n>  . 

Note  that  we  can  express  any  linear  recurrence  system  in  matrix 
terms  as 

x  -  c  +  Ax 
where    c  =  (c1,...,cn)  ,  x  =  (x1,...,xn)t 

and  A  is  a  strictly  lower  triangular  (banded  if  rn  <  n)  matrix  with 

a. .  s  0  for  i  <   j  or  i  -  j  >  m.  We  refer  to  A  as  the  coefficient  matrix, 
« j 

c  as  the  constant  vector  and  x  as  the  solution  vector. 
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It  should  be  observed  that  the  constant  vector  and  coefficient 
matrix  generally  contain  values  which  can  be  computed  before  the  recurrence 
evaluation  begins.  Thus,  the  x.  and  y.  values  of  Eq.  5  would  be  precomputed 
from  the  a.  and  b..  We  will  assume  that  the  elements  of  c  and  A  are  pre- 
computed (if  necessary)  in  all  cases  so  that  our  bounds  on  recurrence 
evaluation  can  be  simply  stated,  and  that  m  and  n  are  powers  of  2. 

How  can  we  solve  an  R<n>  system  in  a  fast,  efficient  way  using  many 
simultaneous  operations?  The  following  is  a  straightforward  way  which  uses 
0(n)  processors  to  solve  the  system  in  0(n)  steps. 
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Column  Sweep  Algorithm 

* 

Given  any  R<n>  system,  we  initially  know  the  value  of  x-. .  On 

step  1  we  broadcast  this  value,  c-. ,  to  all  other  equations,  multiply  by  a., 

and  add  the  result  to  c  Since  we  now  know  the  value  of  x2>  this  leads  to 

an  R<n-1>  system  which  can  be  treated  in  exactly  the  same  way.  Thus  after 
n  -  1  steps,  each  of  which  consists  of  a  broadcast,  a  multiply  and  an 
add,  and  each  of  which  generates  another  x.,  we  have  the  solution  vector  x. 

The  method  requires  n  -  1  processors  on  step  1,  and  fewer  thereafter,  so 
Tp  =  2(n  -  1)  with  P  =  n  -  1. 

What  speedup  and  efficiency  have  we  achieved  by  this  method?  The 
time  required  to  solve  this  system  using  a  single  processor  which  might 
sweep  the  array  by  rows  or  columns  would  be 

1}   =  2[1  +  2  +  ...  +  (n  -  1)]  =  2[^2^-1)]  =  n(n  -  1)  . 

Hence  the  above  method  achieves  a  speedup  of 

c  _  n(n  -  1)  _  /9 
SP   2(n  -  1)  "  n/2 

with  an  efficiency  of 

E  -  ^  -   ,   "   r  >  ^ 
LP       P    2(n  -  1)   2  * 

Thus  we  can  conclude  that  the  Column  Sweep  Algorithm  is  a  reasonable  method 

of  solving  an  R<n>  system.  But  how  does  it  perform  in  the  R<n,m>  case  for 

m«n. 

It  can  be  seen  that  the  Column  Sweep  Algorithm  will  achieve  Sp  =  0(m) 

for  an  R<n,m>  system.  So  if  m  is  very  small,  the  method  performs  poorly, 
particularly  if  we  have  a  large  number  of  processors  available.  It  should 
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be  noted  that  the  m«n  case  occurs  very  often  in  practice.  Note  that  all  of 
our  examples  (Eqs.  2-5)  had  m  <_  2. 

What  are. our  prospects  for  finding  a  faster  algorithm.  First,  we 
observe  that  the  total  number  of  initial  data  values  in  an  R<n,m>  system  is 
0(mn).  This  is  the  total  of  the  constant  vector  c  and  the  coefficient  matrix 
A.  Assuming  that  these  numbers  all  interact  in  obtaining  a  solution,  a 
fan-in  argument  [c.f.  Lemma  1]  indicates  that  we  need  at  least  0(log  mn) 
steps  to  solve  an  R<n,m>  system,  since  m  <_  n,  0(log  mn)  =  0(log  n).  The 
Column  Sweep  Algorithm  required  0(n)  steps,  so  we  still  have  a  big  gap  in  ti 
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Product  Form  Recurrence  Method 

The  next  theorem  is  based  on  an  algorithm  for  the  fastest  known 
method  of  evaluating  an  R<n,m>  system.  For  large  m,  the  number  of  processors 
required  is  rather  large,  but  for  small  m,  the  number  of  processors  is  quite 
reasonable.  We  also  give  bounds  for  the  case  of  a  small  number  of  processors, 
Corollary  4  is  particularly  important  in  the  case  of  m  <  p  <  P.  This  theorem's 
proof  can  easily  be  stated  in  terms  of  the  product  form  of  the  inverse  of  the 
coefficient  matrix  A  [25] »  [60 ]•  It  is  also  proved  in  [19]  and  [22]. 
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Theorem  3  Any  R<n,m>  can  be  computed  in 

Tp  <  (2  +  log  m)  log  n  -  -^log  m  +  log  m) 
with 

P  *  m2n/2  +  0(mn)  !     for  m«n  , 

P  <  n3/68  +  0(n2)      for  m  <_  n  . 
The  details  of  transforming  a  system  to  meet  this  bound  are  fairly 
straightforward  [19].  We  will  give  a  simple  example  here  as  a  basis  for 
some  intuition  about  how  the  technique  of  Theorem  3  works.  Consider  an 
R<4,2>  system.  This  method  would  generate  the  following  expressions  for  the 
evaluation  of  the  x.: 

xl  "  cl  * 

x2  =  (Cg+a^Cj) 

X3  =  (c3+a31Cl}  +  a32(c2+a21cl) 

*4  =  c4  +  (a42+a43a32)(c2+a21cl)  +  a43(c3+a31cl )  ' 

Note  that  all  of  the  parenthesized  expressions  can  be  computed  simultaneously 
in  two  steps  (there  are  just  three  distinct  ones).  Then  x,,  the  largest 

calculation,  can  be  completed  in  three  more  steps  for  Tp  =  (2+log  2) (log  4) 

1    2 
-  -pOog  2+log  2)  =  5.  This  time  bound  may  be  achieved  using  just  three 

processors  in  this  case.  But  as  n  grows  larger,  the  number  of  processors 

required  becomes  very  large  as  shown  in  the  tables  of  [22]. 

In  practice  we  may  have  a  machine  with  a  limited  number  of  processors 

p  <  P  so  Theorem  3  cannot  be  used  directly.  Several  schemes  are 

available  for  mapping  a  computation  onto  a  smaller  set  of  processors  and 

generally  increasing  the  efficiency  of  the  computation  as  well.  While 
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the  techniques  described  below  may  be  applied  to  arithmetic  expressions  as 
derived  from  Theorems  1  or  2,  the  expressions  found  in  typical  programs 
usually  do  not  require  enough  processors  to  warrant  such  reductions  [36]. 
First,  we  describe  a  folding  scheme  which  reduces  the  number  of 
processors  at  a  much  faster  rate  than  the  computation  time  increases.  The 
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P  processor  computation  for  R<n,rn>  resulting  from  Theorem  3  contains  log  n 
stages,  each  stage  consisting  of  many  independent  tree  computations  of 
height  (log  m+1)  resulting  from  inner  products  of  two  m-vectors.  Such  a 
tree  of  height  t  will  contain  2  -1  operation  nodes  and  its  evaluation 
requires  2  "  processors.  P  is  the  maximum  of  the  total  number  of  proces- 
sors used  at  each  stage.  It  is  easy  to  show  that  given  such  a  tree  its 
height  increases  only  one  step  by  halving  the  number  of  processors  (called 
one  fold),  and  after  f  folds  (f  <^  t  -  2)  are  performed  the  tree  height  is 
(t+2   -f-2)  while  the  number  of  processors  is  reduced  to  2   /2  .  If  all 
trees  at  the  same  stage  are  folded  uniformly,  then  this  folding  scheme  can 

provide  us  T  as  stated  below. 
P 

Corollary  g        Let  R<n,m>  and  P  be  as  in  Theorem  3.  Then  if 
f  <  log  m  -  1  and  p  =  TP/2fl,  we  have  T  <  Tp  +  (2f+1-f-2)  log  n. 

Another  technique  which  is  useful  in  mapping  any  computation  onto 
a  limited  number  of  processors  p  <  P  is  the  sweeping  scheme  [44].  If  the 
i-tri'Step  of  any  parallel  computation  requires  0.  operations  using  P  proces- 


sors, it  can  be  executed  on  p  processors  in 
leads  to  the  following: 


0. 
i 


steps.  This  observation 


Lemma  2  [12]     If  a  computation  C  can  be  completed  in  Tp  with  0p  operations 
on  P  processors,  then  C  can  be  confuted  in  T  <  TD  .+  (0D-TD)/p  for  p  <  P. 

P  —  r      r   r 

To  apply  this  technique  directly  on  the  algorithm  of  Theorem 3  ,  the 
C  value  can  be  obtained  by  the  summation  of  2»p(k)  for  k  =  2,  4,  8,  ...,n, 
where  p(k)  is  the  number  of  processors  required  at  each  stage  [19].  The 
result  of  this  technique  can  be  found  in  [22]. 
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Our  third  scheme  for  reducing  the  number  of  processors  required 

for  an  R<n,m>  system  is  called  the  cutting  scheme.  The  idea  is  to  cut 

the  original  system  into  a  number  of  smaller  systems  and  evaluate  these 

in  sequence,  using  the  algorithm  underlying  Theorem  3  on  each  such  system. 

We  have  used  this  scheme  in  [22],  [61],  and  a  detailed  proof  is  given  in 

[25]. 

Corollary  3'        Let  R<n,m>  and  P  be  as  in  Theorem  3.  Then  any  R<n,m> 
can  be  computed  with  1  <  p  <  P  processors  in 

TD  i.  2fm/pl  (n-1 )       for  1  <  p  <_  m  , 

<  1^  np  3(log2p+271og  p+144)  +  ^y-  for  m  <  p  <  m2  , 

-1 

-  72  np  30°92P+271og  P+144)  for  m  <_  p  <_  m  , 

cL  C      O    "7  ^ 

<_  B^-^(log  m  log  p+21og2  p-p-  log  m-j  log  m+1)    for  m  <  p  <  P, 

where  $(m,n,p)  is  a  small  constant. 

For  most  practical  R<n,m>  systems  in  which  m  is  wery   small 
compared  to  n,  if  the  number  of  processors  is  also  very  limited  then  a  new 
computational  algorithm  developed  in  [19]  can  be  used  more  efficiently. 
This  method  gives  the  following  time  bounds. 

Corollary  4         |_et  R<n,m>  and  P  be  as  in  Theorem  3  .  If  m  <  p  <  P> 

then  any  R<n,m>  can  be  computed  in 

T  <  (?m2+3m)j}+  0(m2log  (p/m))  . 
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In  summary,  for  1  <  p  <  P  the  time  bound  for  evaluating  a 
■given  R<n,m>  system  can  be  determined  by  choosing  the  minimum  value 
obtained  from  Corollaries  2,  3  and  4. 
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Constant  Coefficient  Recurrences 

In  numerical  computation,  we  are  frequently  faced  with  linear 
recurrences  having  constant  coefficients,  i.e.,  Toeplitz  form  matrices. 
For  example,  Eqs.  2,  3  and  4  are  such  recurrences.  Thus,  Eq.  2  could  be 
rewritten  as  x.  =  lx.  -.  +  a.  b,  ,  and  similarly  for  Eq .  3  .  Intuitively,  we 
might  expect  to  be  able  to  compute  such  systems  more  efficiently  than  the 
more  general  recurrences  we  have  been  considering. 

Indeed,  this  is  the  case  ,as  we  shall  see  below.  We  formalize 
the  problem  with  the  following  definition  which  should  be  contrasted  with 
Definition  2. 

Definition  3 

An  m-th  order  linear  recurrence  system  with  constant  coefficients 
of  n  equations,  R<n,m>  is  defined  for  m  <_  n  by 

for  i  <  0 


*,-  ■  ° 


and 


m 


x.  =  c.  +  I     a-  x.  .  for  1  <  1  <_   n. 


If  n  =  m  we  call  the  system  a  general  linear  recurrence  system  with  constant 
coefficients  and  denote  it  by  R<n>. 

The  fastest  known  method  for  solving  an  R<n,m>  system  can  be  summarized 
by  the  following  theorem  of  Chen  [19].  The  proof  follows  the  lines  of 
Theorem  3,  but  avoids  computations  which  are  unnecessary  due  to  the  constant 
coefficients. 


Theorem  4 


with 


and 


Any  R<n,m>  can  be  computed  in 
Tp  <_   (3  +  log  m)log  n  -  (log  m  +  log  m  +  1) 

P  _<  mn      for  m«n 

2 
P  <  n  /4    for  m  <  n  . 
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By  exercising  special  care  in  avoiding  redundant  computations, 
the  proof  of  Theorem  4  can  be  modified  [19]  to  give  us  the  following. 

Corollary  5      Any  R<n,l>  can  be  computed  in 
Tp  <  21og  n 

with      P  <_  n  . 

By  comparing  Theorems  3  and  4,  we  see  that  while  the  time  bounds 
are  about  the  same,  substantial  processor  savings  can  be  made  in  the  constant 
coefficient  case.  In  the  case  of  small  m,  we  have  saved  0(m)  processors, 
while  in  the  general  case  we  have  saved  a  factor  of  0(n)  processors. 

To  test  the  quality  of  these  bounds,  we  can  compare  them  with  some 
simple  calculations.  Consider  the  inner  product  of  Eq.  2.  This  can  obviously 
be  handled  using  n  processors  (for  the  n  multiplications)  in  1  +  log  n 
steps.  Since  we  have  been  assuming  the  coefficient  matrix  and  constant 
vector  are  set  up  before  the  recurrence  solution  begins,  the  multiplication 
is  really  outside  our  present  scope,  so  just  n/2  processors  would  be  required 
for  the  summation. 

The  bound  of  Corollary  5  is  thus  high  by  a  factor  of  two  in  each 
of  processor  count  and  time  for  this  trivial  recurrence.  However,  the 
recurrence  method  produces  not  only  the  inner  product,  but  also  all  of  the 
"partial  inner  products"  x-, ,  x2>  ...,  x  , ,  as  well  as  x  .  Chen  [19]  has  also 

given  other  variations  on  the  above  to  handle  these  special  cases  of 
evaluating  only  the  remote  terms  of  recurrences. 

As  a  final  example,  note  that  the  entire  Fibonacci  sequence  of 
Eq.  4  can  be  evaluated  (since  m  =  2)  in 

Tp  <  41og  n  -  3 
with      P  <  2n 
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3.  Program  Analysis 

In  this  section  we  discuss  techniques  for  the  analysis  of  whole 
programs.  These  techniques  can  be  used  to  compile  programs  for  parallel  or 
pipeline  computers.  They  can  also  be  used  to  specify  machine  organizations 
for  high  speed  computation.  Since  we  are  really  just  studying  the  structure 
of  ordinary  serial  programs,  our  results  have  interpretations  for  ordinary 
virtual  memory  machines  and  structured  programming  as  well. 

Our  discussion  is  centered  on  methods  developed  and  used  by  the 
author  and  his  students.  Several  significant  efforts  in  this  general  area 
were  carried  out  elsewhere  earlier.  We  will  briefly  sketch  two  of  the  most 
important  of  these. 

The  first  major  study  of  whole  programs  was  carried  out  by  Estrin 
and  his  students  at  UCLA  in  the  1960s.  They  studied  graphs  of  programs  in 
an  attempt  to  isolate  independent  tasks  for  parallel  execution.  The  lowest 
level  object  considered  as  a  task  was  the  assignment  statement,  while  other 
tasks  ranged  in  complexity  up  to  whole  subroutines.  A  number  of  papers 
[  29,  50,  59     ]  reported  algorithms  for  the  analysis  of  programs  and  the 
results  of  analyzing  a  limited  number  of  real  programs.  This  group  also 
worked  on  various  scheduling  strategies  for  executing  program  graphs. 

Another  effort  was  initiated  by  Bingham,  Fisher  and  Semon  at 
Burroughs  in  1966.  This  group  studied  various  aspects  of  the  structure  of 
programs  and  developed  algorithms  for  the  automatic  detection  of  parallelism 
They  also  investigated  certain  aspects  of  machine  design  related  to  this— 
particular  attention  was  given  to  control  unit  features.   Their  work  was 
reported  in  a  series  of  reports  including  [10,  11   ]. 

Our  own  effort  has  incorporated  graphs  of  programs  as  well  as  the 
two  key  ideas  of  the  previous  section--tree-height  reduction  and  fast 
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recurrence  evaluation.  In  this  way  we  can  carry  parallel  execution  down 
to  the  level  of  individual  operations  in  assignment  statements.  This  is 
of  course  necessary,  if  one  wants  to  execute  programs  in  the  fastest  possible 
way. 

In  order  to  formalize  our  discussion  of  program  graphs  and  their 
manipulation,  we  now  present  a  number  of  definitions. 

Definition  4        An  assignment  statement  is  denoted  by  x  =  E,  where  x 
is  a  scalar  or  array  variable  and  E  is  a  well -formed  arithmetic  expression. 
A  block  of  assignment  statements  (BAS)  is  a  sequence  of  one  or  more  assignment 
statements  with  no  intervening  statements  of  any  other  kind.  Any  BAS  can  be 
transformed  by  a  process  called  statement  substitution  to  obtain  a  set  of 
expressions  which  can  be  evaluated  simultaneously. 

For  example,  the  BAS 

X  =  BCD  +  E 

Y  =  AX 

Z  =  X  +  FG 
can  be  evaluated  using  one  processor  in  6  steps,  ignoring  memory  activity. 
By  statement  substitution  we  obtain  three  statements  which  can  be  transformed 
by  tree-height  reduction  to  obtain: 

X  =  BCD  +  E;  Y  =  ABCD  +  AE;  Z  =  BCD  +  E  +  FG  . 
Since  the  resulting  expressions  can  be  evaluated  simultaneously  in  three 
steps,  we  obtain  a  speedup  of  2.  By  properly  arranging  the  parse  trees  it 
may  be  seen  that  just  five  processors  are  required.  Thus  we  have  efficiency 
Er  =  2/5.  In  general,  the  number  of  processors  required  to  evaluate  a  set  of 
trees  in  a  fixed  number  of  steps  may  be  minimized  using  an  algorithm  of  Hu  [35] 
Note  that  the  speedup  here  results  from  two  effects:  the  simultaneous 
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evaluation  of  independent  trees  and  tree-height  reduction  by  associativity, 
commutativity  and  distributivity. 

Definition  5        An  IF  statement  is  denoted  by  (C)(S,,  ....  S  )  where  C 

is  the  conditional  expression  composed  of  arithmetic  and  logical  operations 
and  S, ,  ...,  S  are  n  different  statements  which  may  be  assignment  statements, 
IF  statements,  or  loops  such  that  control  will  be  transferred  to  one  of  them 
depending  on  the  value  of  C. 

In  many  programs  it  is  possible  to  find  outside  DO  loops,  rather 
large  sets  of  statements  consisting  of  many  IF  and  GOTO  statements  with  some 
interspersed  assignment  statements.  Suppose  we  have  a  method  of  discovering 
sections  of  code  in  which  the  ratio  of  control  (IF,  GOTO)  statements  to 
arithmetic  operations  is  greater  than  some  small  number.  We  call  such  a 
section  of  code  an  IF  block.  Given  an  IF  block,  it  is  straightforward  to 
put  it  in  a  canonical  form  consisting  of: 

Step  1:  A  set  of  assignment  statements,  all  of  which  may  be  executed 
simultaneously. 

Step  2:  A  set  of  Boolean  functions,  all  of  which  may  be  evaluated 
simultaneously. 

Step  3:  A  binary  decision  tree  through  which  one  path  will  be 
followed  for  each  execution  of  the  program.  No  Boolean  function  or  arithmetic 
expression  evaluation  is  included  in  the  tree. 

Step  4:  A  collection  of  blocks  of  assignment  statements,  each  with 
a  single  variable  or  constant  on  the  right-hand  side.  One  such  block  is 
associated  with  each  path  through  the  tree. 

The  details  of  an  algorithm  for  the  discovery  and  transformation  of 
an  IF  block  to  this  canonical  form  are  given  by  Davis  [28].  Note  that  the  IF 
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block  may  be  a  graph  with  or  without  cycles.  Such  graphs  are  converted  to 
trees  called  IF  trees  in  the  cited  references. 

Definition  6        A  loop  is  denoted  by 

L  •  (Iv+  N]s  I2  -  N2,  ...,  Id  -  Nd)(Sr  S2,  ....  Ss) 

or        =  (1^  I2,  ...,  Id)(Sr  S2,  ...,  S$) 

where  I.  is  a  loop  index,  N.  is  an  ordered  index  set,  and  S.  is  a  body  statement 
j  j  j 

which  may  be  an  assignment  statement,  an  IF  statement  or  another  loop.  We  use 

0UT(S.)  and  IN(S.)  to  denote  for  S.  the  LHS  (output)  variable  name  and  the 
j        j  j 

set  of  RHS  (input)  variable  names,  respectively.  We  will  write  S.(i\,  i2,  ...,  ij) 
to  refer  to  S.  during  a  particular  iteration  step,  i.e.,  when  the  index 
variables  of  S,  are  assigned  the  specific  values  I-j  =  i-j»  I2  =  i^'  ■••»  *d  =  ^d" 

If  S.  is  executed  before  S-,  we  will  write  S.  <  S..  We  say  that  the  relation 
i  J  i  o  j 

<  defines  the  execution  order  of  the  statements.  If  a  loop  execution  leads  to 

the  execution  of  n  statements,  we  sometimes  denote  their  execution  order  by 

writing  Y-:x.  =  E. ,  1  £  i  <_  n,  implying  that  Y.  <  Y.+, ,  1  <_  i  <_  n  -  1 . 

Definition  7       Given  a  loop  L  =  (I,  +   N, ,  . . .,  I  .  <-   N,)(S, ,  . . .,  S  ), 

all  possible  data  dependencies  between  statement  pairs  S  and  S.  are  given  by 

0UT(S1(k1,...,kd))niN(SjU1,...,2d))  f  <t>     for  S.(kr...,kd)  <  Sj(^] ,. . .  ,*d)  . 

Whenever  this  condition  is  satisfied,  we  say  that  S.  is  data  dependent  on  S.. 

and  is  denoted  by  5-6S..  6  is  a  transitive  relation.  All  of  the  data 

■  j 

dependencies  can  be  represented  by  a  data  dependence  graph  G-.  of  s  nodes  for 

Sj,  1  <  i  <  s.  For  each  S.6S.  there  is  an  arc  from  S.  to  S. .  Statement  S.  is 
i    —   —  i  j  •     j  J 
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Indirectly  data  dependent  on  S^  denoted  S.AS.,  if  there  exist  statements 

Sk,  *  ••*■  Sk   such  that  Si6Sk  6*"  Sk  5Sr   Practical  details  on  determining 
1       m  1      m  J 

1f  S.6S.  can  be  found  in  [67]. 
'  J 

Our  definition  of  data  dependence  is  much  more  delicate  than  the  usua 

definitions  [  9,  30      ].  These  definitions  include  the  condition 

0UT(Si)|lIN(Sj)  f   <J>,  i.e.,  they  ignore  subscripts  and  only  check  variable  names 

Thus  statements  like  S^  A(I)  =  A(I+i")  +  B  are  said  to  be  data  dependent  (S.6S 

However,  by  Definition  7  we  would  not  say  Si6Si  because  the  values  of  A(I+1) 

are  not  those  from  A(I). 

In  terms  of  Definitions  6  and  7,  we  can  further  classify  loops  as 
follows. 

Definition  8        We  use  D  for  data  dependence  relation,  to  denote  the  set 
of  loops  with  at  least  one  S.5S..,  1  <  i,j  <  s.  In  other  words,  there  is  at     j 
least  one  Ek,  1  <  k  <  n,  which  is  a  function  of  x,    ,  for  m,  >  0.  If  [e  D 
and  none  of  its  Si  is  a  nonlinear  function  of  x.,  1  <  j  <  s,  we  call  it  a 

linear  dependence  and  write  ULD  (LDCD).  The  complement  of  D  is  denoted  by  D, 
for  non-dependence  relation. 
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Definition  7  can  be  applied  to  any  (d-u+1),  1  <_  u  <  d,  innermost 
nest  of  L  as  it  is  also  a  loop.  This  is  described  below. 


Definition  9 


Let  Lu  be  the  (d-u+1)  innermost  nest  of  L,  1  <_  u  <_  d, 


I.e.)      L  =  \  l-i  ,  1«»  .  .  .  »  1j  J  ^j-i  » bp  > .  •  .  >  j  ; 

=  -(i1,i2,...,iu_1)(iu,iu+-|,...,icj)(s.1,s2,...,ss) 

=  (I1.I2.,...IU.1)(LU)  . 
Then  for  fixed  values  of  I-,,  I2,  ...»  I  -. »  we  can  obtain  all  pairs  of  data 
dependence  for  Lu  according  to  Definition  8  (note  that  now  k,  =  £,,  ...,  k  , 
s  *u~l^'  which  defines  graph  Gu- 


Example  1 


Given  a  loop 


L:  DO  S, 


DO  S. 


DO  S. 


Sp*  B(I-i ,!«» 


=  1,  10 


2  =  1,  10 

:3-i.  io 

:3)  =  B(iri,i2,i3)*c(i1,i2)  +  D*E 
:3)  =  a(i1,i2-i,i3)*f(i2Si3)> 

The  corresponding  data  dependence  graphs  G, ,  G2,  and  G~  are 


ft,: 


G2: 


G3: 


1 
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For  the  set  of  data  dependent  loops,  we  can  easily  distinguish 
two  cases:  acyclic  and  cyclic  graphs.  Formally,  we  define  these  as  follows: 

Definition  10       An  acyclic  dependence  graph  is  a  dependence  graph 

of  s  nodes,  S.  for  1  <_  i  <_  s,  with  no  pair  (S.,S.)  such  that  S-AS.  and  S-AS.. 

A  dependence  graph  which  is  not  acyclic  will  be  called  cyclic. 

Given  a  data  dependence  graph,  we  wish  to  partition  it  into  blocks 
that  contain  only  one  statement  or  a  cyclic  dependence  graph.  Formally,  we 
define  these  as  follows: 

Definition  11       On  each  dependence  graph,  G  ,  1  <_  u  <  d,  for  a  given 

loop  L,  we  define  a  node  parti  tion  t     of  {S,  ,Sp, . . .  ,S  }  in  such  a  way  that 

S.  and  S,   are  in  the  same  subset  if  and  only  if  S.AS.  and  S.AS. .  On  the 

partition  tt  =  {tt-,  ,  tt  ~>  •••)  for  1  £u  <_d,  define  a  partial  ordering 

relation  a  in  such  a  way  that  tt  .  a  tt  .  (reflexive),  and  for  i  t   j,  tt  .  a  tt  . 

iff  there  is  an  arc  in  G  from  some  element  of  tt  .  to  some  element  of  tt  . . 

The  a  relation  is  also  anti -symmetric  and  transitive.  The  tt  .  are  called 
Tr-blocks. 
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Wave-Front  Method 

If  there  are  cyclic  dependencies  in  a  DO  loop,  we  may  turn  to  our 
next  method,  the  wave-front  method.  This  is  a  well-known  method  which 
effectively  extracts  array  operations  from  the  loop  and  we  can  then  apply 
the  above  bounds  to  these.  If  the  maximum  speedup  given  by  the  wave-front 
method  is  insufficient,  i.e.,  if  the  available  processors  are  not  all  being 
used,  we  may  turn  to  the  recurrence  method  which  gives  the  fastest  known 
speedup  for  such  problems. 

Example  2 

L2:  DO  10  I  =  1,  N 
DO  10  J  =  1,  N 
10   W(I,J)  =  A(I-1,J)  *  W(I-1,J)  +  B(I,J-1)  *  W(I,J-1) 

For  one  or  more  assignment  statements  containing  cyclic  dependencies, 
the  wave-front  method  yields  moderate  speedups  with  high  efficiency.  The 
idea  of  this  method  can  be  illustrated  by  the  loop  L2  of  Example  2  in  which 
statement  10  has  a  cyclic  dependence  in  that  the  LHS  depends  on  RHS  values 
computed  earlier  in  the  loop.  Note  that  generally,  one  or  more  statements 
may  form  a  cyclic  dependence.  This  method  proceeds  as  follows:  if  W(l,l) 
is  computed  from  boundary  values,  then  we  can  compute  W(2,l)  and  W(l,2)  in  terms 
of  W(l,l)  and  boundary  values.  Next  we  can  compute  W(3,l),  W(2,2)  and  W(l,3) 

and  so  on,  as  a  wave-front  passes  through  the  W  array  at  a  45°  angle.  Thus 

o 

we  can  compute  this  loop  in  0(N)  steps  instead  of  the  0(N  )  serial  steps 

required.  The  wave-front  method  was  first  described  in  detail  by  Muraoka 
in  [54]  and  was  later  used  in  [44]  and  was  also  implemented  in [46].   The 
formalization  below  removes  some  of  the  restrictions  included  in  the  original 
formulation. 

In  [24]  a  revised  wave-front  algorithm  is  presented.  This  includes 
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a  method  of  determining  the  angle  a  at  which  the  wave-front  passes  through 
the  array.  It  also  includes  a  method  for  computing  the  speedup  as  a  function 
of  a  .  Note  that  these  ideas  can  be  extended  to  arrays  of  higher  dimension, 
as  well.  However,  the  wave-front  method  is  of  no  value  in  one-dimensional 
arrays,  since  it  degenerates  to  a  serial  computation  in  this  case.  A 
similar  thing  happens  if  a  is  slightly  greater  than  0°  or  slightly  less 
than  90°.  In  such  cases  we  may  treat  the  cyclic  dependence  as  a  linear 
recurrence  (assuming  it  is  linear). 

Loop  Speedup  Hierarchy 

With  the  above  fundamentals,  it  is  possible  to  give  some  easy 
bounds  on  overall  loop  speedup  in  terms  of  the  uniprocessor  time  T-, . 
We  will  present  a  simple  hierarchy  here  based  on  the  maximum  known  speedups 
for  various  classes  of  programs.  Sharper  bounds  will  be  presented  later 
in  the  paper,  based  on  more  detailed  loop  parameters.  The  hierarchy  of 
this  section  will  provide  good  intuition  for  the  following  sections. 

The  simplest  loop  is  LeD  which  by  Definition  8  has  no  dependence 
relation  between  any  pair  of  statements.  Thus,  following  the  notation  of 
Definition  6,  all  x.  =  E.,  1  <  i  <  n,  can  be  computed  in  parallel.  The 

following  loop,  which  performs  matrix  addition  and  scalar  product  has  this 
property. 

DO  S2  ^  =  1,  10,  2 
DO  S2  I2  =  1,  10,  1 
....   S,:  G(IrI2)  =  A(Ir,I2)  +  B(IrI2) 
S2:  Z(IrI2)  =  C(IrI2)  *  D(IrI2) 

The  total  time  required  by  any  LeD  is,  by  Theorems  1  and  2,  Tp  <_  0(log  e) 
where  e  is 
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the  maximum  number  of  atoms  in  E.,  1  <_  i  <_  n.  Hence,  we  have  for  LeD 

spiorrchr  0(V  • 

Now,  let  us  study  a  slightly  more  complicated  loop  LeLD  such 
as  one  that  performs  vector  inner-product 

DO  5  I  =  1,  10 
5   T  =  T  +  A(I)*B(I)  . 
For  any  LeLD,  if  we  pre-compute  simultaneously  all  subexpressions  in  E. , 

1  £  i  <_  n,  which  do  not  depend  on  any  computed  value  in  the  loop,  i.e.,  any 
x.  for  1  £  i  <_  n,  then  the  resultant  statements  x.  =  El,  1  <  i  <  n,  can  be 

treated  as  an  R<n,m>  system  where  m<n  is  the  maximum  of  m.  (see  Definition  8  ) 

for  all  i.  The  total  computation  time  of  any  LeLD  with  m«n,  or  m  independent 

of  n,  is  therefore,  any  preprocessing  time  needed  to  obtain  the  coefficients 

in  R<n,m>,  which  is  0(log  e)  time  steps  by  Theorems  1  and  2,  plus  the  time 

to  solve  an  R<n,m>  system  which  is  stated  in  Theorerr  3.  Since  n  <_  T,  <_  ne, 

we  have  a  speedup  for  this  subset  of  LD, 

y  T1 

SP  -  0(log  m  log  n)  +  0(log  e)  "  °^log  1^    ' 

Next,  consider  the  subset  of  loops  which  has  m  -  n  or  m  a 

function  of  n.  For  example,  given  an  upper  triangular  matrix  A,  to  solve 

Ax  =  b  by  the  traditional  back-substitution  method  we  may  write  a  loop  like 

DO  5  I  =  10,  1,  -1 

X(I)  =  B(I)/A(I,I) 

DO  5  J  =  I  +  1,  10,  1 

5   X(I)  =  X(I)  -  (A(I!J)/A(I,I))*X(J)  . 

In  this  example,  if  we  preprocess  B(I)/A(I,I)  for  all   I,  and  A(I,J)/A(I,I) 

for  all    I, J,  we  obtain  an  R<n,n>   system.     Since  m  =  n,  this  is  the  worst-case 

loop  of  LD.     Hence,  we  can  say  that  the  computation  time  of  any  LeLD  is  less 
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than  0(1og  e)  plus  the  time  stated  in  Theorem  3,   i.e.,  for  any  LeLD 

Sp  1 5 ! =  0( A—)    . 

r    '  O(logS)  +  0(log  e)  locfT^ 

Finally,  we  study  a  simple  looking,  but  more  complicated  loop: 

DO  5  I  =  1,  10 
5   X(I)  =  (X(I-l)  +  A/X(I-l))/2  . 
This  is  a  familiar  iterative  program  for  approximating  JK.     For  this  loop, 
LeD  but  ULD.  Muraoka  [54]  shows  that  by  using  statement-substitution  any 
loop  with  E-  being  a  d-th  degree  polynomial  of  x.  -,,  d  >  1,  can  be  speeded 

up  at  most  by  a  constant  factor.  Later,  Kung  also  studied  this  problem  [45] 
in  a  similar  way.  However,  since  we  have  been  able  to  linearize  a  number  of 
nonlinear  recurrences,  it  remains  an  open  question  which  techniques  besides 
statement-substitution  may  be  used  to  speed  up  such  loops. 

Summarizing  the  above,  we  are  able  to  classify  all  loops  in  terms 
of  their  best  known  speedups  over  serial  computation  time  T, ,  i.e., 

Sp  =  ■ f      for  0  <  i  <  2  (6) 

P   c^log  t/ 

We  call  a  loop  Type  i ,  0  <_  i  <  2,  if  its  maximum  speedup  has  the  form  of 
Equation  6  ,  or  Type  3  if  its  maximum  speedup  is  of  a  lower  order  of  magnitude. 

This  was  also  discussed  in  [39]. 

1 12. 
By  the  wave-front  method  we  are  at  best  able  to  achieve  Tp  =  0(T, 

1/2 
with  Tp  =  0(T,)  in  the  worse  case.  Thus  we  have  Sp  <_  0(T-|   ).  Since 

the  wave-front  method's  speedup  is  always  inferior  to  the  recurrence  method 
for  such  problems,  this  is  consistent  with  our  c1:im  that  Equation  6 
represents  a  maximum  speedup  hierarchy. 
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Loop  Distribution 

Now  we  turn  to  the  problem  of  compiling  parallel  operations  from 
serial  loops  which  do  not  contain  IFs  and  GOTOs  (we  will  consider  these 
later).  There  are  two  key  ideas  involved  here:  one  is  the  reduction  of  data 
dependence,  the  other  is  the  distribution  of  loop  control  over  the  loop 
statements.  We  give  our  Loop  Distribution  Algorithm  which  includes  both  of 
them  and  summarizes  our  handling  of  loops  without  IFs. 

When  constructing  a  data  dependence  graph,  we  wish  to  avoid  any 
apparent  dependencies  which  do  not  really  hold  for  the  given  subscripts  and 
index  sets.  This  problem  was  first  studied  in  a  general  way  by  Bernstein  [9  ] 
for  unsubscripted  variables.  A  powerful  test  for  subscripted  variables  was 
given  by  Muraoka  [54].  This  has  been  refined  by  Towle  in  [24]  and  in  [67]. 
By  avoiding  the  inclusion  of  spurious  data  dependencies,  we  may  be  able  to 
execute  more  statements  in  parallel  on  machines  capable  of  executing  multiple 
array  operations  (e.g.,  the  Texas  Instruments'  ASC).  Also,  we  may  be  able 
to  break  cyclic  dependencies,  thereby  reducing  i  in  Eq.  6  and  yielding  higher 
speedup. 

Another  way  to  achieve  statement  independence  is  through  statement- 
substitution.  This  yields  increased  speedup,  sometimes  at  the  cost  of  redundant 
operations.  It  should  be  used  with  discretion,  and  only  in  machines  with  a 
high  degree  of  parallelism.  For  acyclic  graphs,  it  is  easy  to  demonstrate 
that  we  can  perform  statement-substitution  between  any  pair  of  nodes  which 
have  a  dependence  relation.  As  in  a  BAS  (c.f.  Definition  4 ) ,  we  substitute 
for  each  LHS  variable  of  S.  on  the  RHS  of  S.,  which  is  the  cause  of  a  dependence 
relation,  the  corresponding  arithmetic  expression  on  the  RHS  of  S.  with  all 
subscript  expressions  properly  shifted.  By  applying  statement-substitution, 
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the  dependence  relation  is  removed  and  a  set  of  independent  assignment 
statements  results.  Each  of  these  represents  a  vector  assignment  statement, 
all  of  which  can  be  executed  simultaneously.  Theorems  2  and  3  can  be  used 
to  bound  the  time  and  processors. 

In  loops  with  acyclic  graphs,  we  can  thus  reduce  the  graph  for 
the  entire  loop  to  a  set  of  independent  nodes  representing  simultaneously 
executable  array  statements.  However,  in  general,  we  must  deal  with  cyclic 
graphs  containing  several  interdependent  nodes.  Our  loop  distribution 
algorithm  will  be  useful  in  handling  these  cases.  By  loop  distribution  we 
mean  the  distribution  of  the  loop  control  statements  over  individual  or 
collections  of  assignment  statements  contained  in  the  loop.  The  idea  of 
loop  distribution  was  introduced  by  Muraoka  [54],  and  later  was 
implemented  in  our  Fortran  program  analyzer  to  measure  potential  parallel- 
ism in  ordinary  programs  [41],  [44]. 

The  purpose  of  distributing  a  given  type  i  loop  is  to  obtain  a 
set  of  smaller  size  loops  of  type  j,  0  <_  j  <_   i,  which  upon  execution  give 
results  equivalent  to  the  original  loop.  This  is  essentially  to  reduce 
a-   in  Eq.  6  (and  hence  increase  speedup)  as  much  as  possible.  In  fact, 
the  loop  distribution  algorithm  resembling  the  distribution  algorithm 
for  the  reduction  of  tree  height  of  an  arithmetic  expression,  may  intro- 
duce more  parallelism  into  a  program  loop  than  that  obtained  from  an 
undistributed  one.  We  now  give  the  algorithm  to  accomplish  this  distribu- 
tion as  presented  in  [24]. 


49 


Loop  Distribution  Algorithm 
Step  1       Given  a  loop 

by  analyzing  subscript  expressions  and  indexing  patterns,  construct  a 
dependence  graph  G  (c.f .  Definitions  7  and  9)  for  1  <  u  <  d. 

Step  2       On  G  1  <  u  <  d,  establish  a  node  partition  it  as  in 
Definition  11 . 


Step  3       On  the  partition  ?  ,  1  <  u  <  d,  establish  a  partial  ordering 
relation  as  in  Definition  11. 

Step  4       Let  the  (d-u+1 )  innermost  nest  of  L  be  Lu,  1  <_  u  <_  d,  i.e., 
L  =  (I1,I2,...,Id)(S1,S2,...,Ss) 

=  (i1,i2,...,iu.1){(iu,iu+1,...,id)(s1,s2,...,ss)> 

=  (I1,I2,...,Iu_1)(Lu)  . 

Replace  Lu  according  to  it  with  a  set  of  loops  { ( I ) (tt  ,),(I)(tt  ?),...} 
where  (I)  =  (I,,.!^.....!,,)  . 
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The  condition  of  the  partial  ordering  relation  a  insures  that 
data  are  updated  before  being  used.  Hence,  any  execution  order  of  the  set 
of  loops  which  replaces  L  will  be  valid  as  long  as  this  relation  is  not 
violated.  Thus,  for  fixed  values  of  I, ,  I,,,  ...,  I  _, ,  if  tt  .  a  tt  .  then 

loop  ( I ) (tt  . )  must  be  evaluated  before  ( I ) (tt  .),  otherwise  they  may  be 
computed  in  parallel.  In  general,  we  can  also  use  statement-substitution  to 
remove  this  relation  between  some  or  all  of  the  distributed  loops.  But,  by 
not  allowing  statement-susbstitution  we  have  a  somewhat  simpler  compiler 
technique;  one  which  generally  requires  fewer  processors  and  yields  less 
speedup. 

As  an  example  of  the  use  of  our  loop  distribution,  consider  the 
following  pseudo-FORTRAN  program. 


Example  3 


DO  10  I  =  1,  N 
S-,:         A(I)  =  B(I)  *  C(I) 

DO  20  J  =  1,  N 
S2:  D(J)  =  A(I-3)  +  E(J-l) 

S3:  20  E(J)  =  D(J-l)  +  F 

DO  30  K  =  1 ,  N 
S4:  30  G(K)  =  H(I-5)  +  1 

S5:      10  H(I)  =  SQRT(A(I-2)) 

Following  step  1  of  the  Loop  Distribution  Algorithm,  we  obtain  a  dependence 
graph  as  shown  in  Fig.  5.  We  use  brackets  to  denote  loop  nesting.  For 
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simplicity  and  speedup  in  this  program,  we  only  consider  the  case  u  =  1. 

In  step  2,  we  form  the  partition  tt,  =  {tt,  , ,  it,  ?>  ^iv77^  ^ere 

71 11  =  ^Sl  '  ^12  =  ^S2'S3^'  ^13  =  ^}»  and  ^14  =  ^r)  '     These  partitions  are 
partially  ordered  on  step  3  as  follows:  tt,  ,  a  tt^-i  >  it..,  a  tt,-  and  tt,.  a  tt,3  . 

Since  we  are  considering  only  the  case  u  =  1  here,  we  ignore  step  4. 

The  result  of  this  transformation  is  shown  in  Fig.  6.  We  could  use 
this  graph  to  compile  array  operations  as  follows.  First,  S,  yields  a  vector 
multiply.  Next,  we  can  execute  it,,,  or  tt,..  tt,,,  leads  to  a  linear  recurrence 

of  the  form  R<N,3>  which  can  be  solved  by  the  method  of  Theorem  3,  by  combining 
the  D  and  E  arrays  as  an  unknown  vector  in  which  x,  represents  D(l),  x~ 

represents  E(l),  x3  represents  D(2),  x,  represents  E(2),  etc.  tt,,  leads  to 

the  execution  of  S5  as  a  vector  of  square  roots.  Finally,  S.   may  be  executed 
for  all  I  and  K  simultaneously.  Note  that  this  requires  the  broadcasting  of 
elements  of  the  H  array  to  all  elements  in  the  columns  of  G. 

Here,  the  time  required  to  execute  tt,,,  tt,3,  and  tt-.-  is  independent 
of  N  using  0(N)  processors.  The  overall  execution  time  is  dominated  by  tt, « 
and  is  0(log  N),  so  this  is  a  type  1  loop.  The  number  of  processors  required 
to  achieve  this  time  is  0(N). 

Notice  that  in  this  example  we  avoided  statement-substitution.  Using 
statement-substitution,  we  would  have  been  able  to  obtain  four  TT-blocks,  all  of 
which  could  be  executed  at  once.  This  would  require  the  execution  of  several 
different  operations  at  one  time,  while  the  technique  we  used  allows  all 
operations  at  each  step  to  be  identical.  Furthermore,  wery   little  additional 
speedup  would  be  possible  by  this  method  since  tt-,?  dominates  the  time  here. 
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IFs  in  Loops 

To  this  point  we  have  considered  DO  loops  without  conditional 
statements.  The  addition  to  DO  loops  of  IF  and  GOTO  as  well  as  computed 
GOTO  statements,  can  cause  major  problems.  In  particular,  data  dependencies 
can  be  changed  at  execution  time  by  the  existence  of  such  conditional  statement 
Thus,  knowledge  at  compile  time  of  what  can  be  executed  in  parallel  may  be 
difficult  to  obtain.  In  the  worst  case,  we  may  be  forced  by  not  knowing  about 
control  flow,  to  compile  loops  for  serial  execution  which  in  fact  can  be 
executed  in  a  highly  parallel  way. 

In  this  section  we  will  consider  the  addition  to  DO  loops  of  IF 
and  computed  GOTO  statements,  denoted  as  IFs.  The  first  part  of  this  section 
contains  definitions  and  preliminary  results  concerning  IFs.  This  leads 
to  a  distribution  algorithm  which  allows  for  IFs.  The  importance  of  this 
algorithm  is  that  using  it  we  can  often  execute  all  of  the  DO  loop  simul- 
taneously, or  at  worst,  localize  the  part  of  the  DO  loop  that  will  be  done 
sequentially.  Then  we  present  a  summary  of  the  analysis  of  one  year  of  CACM 
Fortran  programs. 

Before  proceeding,  we  need  a  few  definitions  concerning  IFs  and 
control  notions. 

Definition  12       Statement  S.  is  an  immediate  control  successor  of  S., 
denoted  S.yS.,  if  S.  is  executed  immediately  after  S..  S.  is  not  unique 

when  S.  is  an  IF.  Statement  S.  is  a  control  successor  of  S.,  denoted  S.TS-, 

if  there  exist  statements  S.  ,...,S.   such  that  S.yS.  yS.  y...yS,  yS • . 

1      m  1   2      m 

Definition  13       Given  a  loop  L  =  (I,  ■*-  N,  ,...,Id  «-  Nrf)  (S-j , . . .  ,SS)  with 


53 


S.  =  (C)(S.  ,. .  .,S.  );  the  j-th  follower,  1  i  J  i  n,  of  IF  statement  S., 
1       nl     \ 

denoted  by  F.(S.)  is  the  set  {S.  |S.  =  S.  or  S.  rS,}  .  We  will  refer  to  an 

J   1  K   K      I  •       1  .   K 

J        J 

arbitrary  j-th  follower  as  a  follower.  The  set  of  common  followers  of  IF 

n 
statement  S.,  denoted  by  CF(S.),  is  the  set  {S.|S.  e  f\    F.(S.)  such  that 

1  1  K   K    -i  =  l   J 

if  S.  is  executed  then  S.  is  executed}.  The  set  of  parent  IFs  of  statement 

S.,  denoted  by  PIF(S.),  is  the  set  (SJS?  is  an  IF,  S  rS.  ,  and  there  is  no 

IF  statement  S  such  that  S.rS  rS.  where  S|/ECF(S  )}.  when  there  is  more 

than  one  IF  in  the  loop,  it  is  possible  that  some  of  the  followers  will 
resemble  trees. 

Example  4 

DO  S3  I,j  =  5,  14,  1 

DO  S3  I2  =  5,  14,  1 

S^   If  ^  >  I2  then  (S2:)A(I1,I2)  =  A(  I-, -2, 1 2~2)+A(  I-, -4 ,  I2-4)+A(  I-, +2 , 1 2+2) 

+  A(I1+4,I2+4) 
S3:  If  ^  <  I2  then  (S4:  )B(I]  ,I2)  =  B(I-,-l  ,1^*0(1^ , 1 2)+B ( I-, -1  ,r2)*D(I1  ,I2) 

S5:  If  ^  =  I2  then  (Sg:)  if  M0D(Ir2)  =  0  then  (S?:)  E( I -j , I2)  =  Ed-,-1  .Ig-l) 

+F(IrI2) 

In  Example    4,  we  have  F,,   =  {S2,S3,S4,S5,S6,S7}   , 
F12  =  (S3,S4,S5,S6,S7>,   F31    =  ^S4,S5,S6,S7>,    F32  =  <S5.Sg.S7>,   Fgl    -   {S6,S?}    , 

F52  =  ♦•   F61   =  {S7}   '   F62  =   {*}'   CF(V   =   {S3'S4'S5'S6'S7}'     CF(S3)   =   {S5'S6'S7}' 
CF(S5)   =  4>,   CF(S7)   =  (J),   PIF(S2)   =  PIF(S3)   =  S] ,   PIF(S4)   =  PIF(Sg)   =  S3, 
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PIF(S6)  =  S5>  and  PIF(Sy)  =  Sfi. 


Definition  14 


Given  a  7T-block  ir^  and   IF  statement  S.   in  tt   .,   the 


k-th  control   path  with  respect  to  S.   through  tt   .    is  the  set 

J  U  I 

{S££7ruilS£  rSj   or   (WV  and  S£  i  sj)>-  We  will    refer  to  an  arbitrary 
j-th  control    path  as  a  control    path.     Note  that  the  number  of  control 
paths  through  ttu1    is  the  same  as  the  number  of  followers  of  S.. 

Example  5 

DO     S1      I  =  1,   N 

V  A(D  =  B(I-2)  +  C(I) 

V  IF   (A(I-l).EQ.O)     THEN  DO 
s3-  B(I)   =  3 
s4:                                                         D  =    .FALSE. 

END 
ELSE  DO 

V  E(I)   =  E(I-3)   +   1 

V  D  =  TRUE 

END 

V  F(J)  =  o 

Using  Example  5  we  will   point  out  the  differences  between  followers 
and  control    paths.     F^Sg)   =   {S3.S4.S7},   F2(S£)   =   {S5.Sg.S7},   and  CF(S2)   =   {S?} 

In  the  TT-partition  ^   =  {{S^Sg.Sg},   {S4}.   {S5>,   {Sg},    {S?}}   notice  that  F]  (S2) 

is  contained  in  3  Tt-blocks  and  F2(S2)    is  not  in  the  same  Tr-blocks  as  F^SJ. 

There  are  two  control    paths  in  {SpSg.Sg}.     One  control   path  {S^Sg}  contains  a 
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statement  not  in  any  follower  of  S2  and  not  all  the  statements  in  F,^). 

The  other  control  path  (S,)  does  not  contain  any  statements  of  Fp(Sp).   In 

general,  a  control  path  can  contain  statements  not  in  any  follower  of  the  IF 
and  does  not  have  to  contain  any  statements  of  a  follower. 

In  an  IF-free  DO  loop  L,  LeD,  we  know  that  the  data  dependencies 
for  each  iteration  do  not  change  as  a  function  of  the  input  data.  The 
addition  of  IFs  can  give  rise  to  several  distinct  data  dependencies  for 
different  input  data  sets.  The  data  dependencies  on  different  iterations 
can  be  distinct  for  all  iterations,  for  some  iterations  or  for  only  one 
iteration.  Thus  we  may  have  statements  that  can  have  several  different 
combinations  of  data  dependence. 

Example  6 

DO  S5  I  =  1,  N 

S}:  IF  C(I-l)  =  D(I)  THEN  (S20  A(I)  =  3  P(I) 

ELSE  DO 

S3:  A(I)  =  4*Q(I) 

S4:  B(I)  =  R(I)  +  1 

END 

S5:      C(I)  =  A(I)  +  B(I-l)  +  2 

Definition  15       The  set  of  different  combinations  of  data  dependence 
into  statement  S.  is  called  the  data  dependence  combinations  of  statement 
S.  and  is  denoted  by  DDC . .  The  number  of  followers  of  S.  that  contain  S. 

is  denoted  f... 
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Statement  Sr  in  Example  6  can  be  computed  in  four  different 
ways.  First,  the  value  of  A(I)  can  be  from  S^  and  B(I-l)  from  outside 
the  DO  loop.  Second,  the  value  of  A(I)  can  be  from  S2  and  B(I-l)  from  S.. 
Third,  the  value  of  A(I)  can  be  from  S3  and  B(J-l)  from  S,.  Fourth,  the 
value  of  A(I)  can  be  from  S3  and  B(I-l)  from  outside  the  DO  loop. 

Using  Definitions  8,  12,  13  and  15,  we  can  classify  IFs  into 
three  types: 

Definition  16        Given  a  loop  L  =  (^  *  N,,...,I.  *  Nd)(S] , . . .  ,S  )  and 

IF  S.  =  (C)(S.  ,...,S.  ),  1  ■:  i  <  s,  we  say  S,  is 
1       nl     \  ! 

a)  Type  A  iff  there  does  not  exist  S-,  1  <  j  <  s,  such  that 
S.6S.   and  Uj,...,^)  OlN(S.)  =  <|>  . 

b)  Type  B  iff  one  of  the  following  holds: 

1)  All  but  one  of  F.(S.)  branch  out  of  the  loop. 

n 

2)  For  each  Ske  \J    Fj(Sj)  such  that  S^.,  |DDC(Sk)  |/fik 

<_  1  and  each  of  the  data  dependence  combinations  of  S. 

only  include  data  dependence  on  the  statements  in  a 
single  follower  of  S.  and/or  statements  not  in  any 
follower  of  S. . 

3)  Type  C  iff  S.  is  not  type  A  or  type  B. 

Type  B  IFs  can  be  further  subdivided.  A  prefix  type  B  IF  is  a  type  B  IF 
that  is  not  data-dependent  on  any  statement  in  its  followers.  Postfix  type 
B  IFs  are  all  other  type  B  IFs. 

Next  we  will  discuss  compiler  algorithms  for  array  machines.  These 
machines  have  two  characteristics  of  which  we  want  to  take  advantage.  First, 
These  machines  operate  on  whole  arrays.  Thus  we  need  to  transform  programs 
to  operate  on  the  whole  array  at  once  rather  than  element  by  element. 
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Second,  these  machines  can  selectively  omit  certain  elements  of  an  array 
during  array  operations.  We  define  mode  bits  as  the  indicators  of  which 
array  elements  are  to  be  operated  on.  Thus  we  want  to  transform  IFs  to 
generate  mode  bits.  Both  of  these  characteristics  let  us  obtain  speedups 
over  uniprocessor  machines. 

As  we  saw  earlier  for  IF-free  loops,  by  distributing  DO  loop 
indices  over  7T-blocks  we  are  able  to  transform  a  given  loop  into  one  or 
more  loops  each  containing  a  vector  or  recurrence  operation.  In  [23]  we 
gave  a  modification  of  that  algorithm  to  allow  IFs.  The  first  goal  of  the 
algorithm  is  to  localize  data  dependence  and  the  effects  of  IFs.  Second, 
the  algorithm  handles  the  IFs  in  four  different  ways. 


58 


1)  Type  A  IFs  are  the  easiest  to  handle.  One  loop  is  compiled  for 
each  follower  and  an  IF  is  used  at  execution  time  to  select  which  loop  to 
execute. 

2)  Prefix  type  B  IFs  use  mode  bits  to  "prefix"  the  body  statements. 
The  IF  is  used  only  to  set  up  the  mode  bits.  The  mode  bits  are  set  up  once 
and  then  used  by  the  body  statements  as  necessary. 

3)  Postfix  type  B  IFs  require  execution  of  each  control  path  for 
the  full  DO  loop  index  set.  Then  we  postfix  by  merging  the  outputs  from  each 
control  path. 

4)  Type  C  IFs  are  executed  serially. 
Since  the  type  A  IF  depends  on  variables  not  set  inside  the 

loop,  such  IFs  can  be  removed  from  loops  trivially.  However,  good  pro- 
grammers seldom  write  such  statements,  so  this  is  a  moot  point. 

In  the  prefix  type  B  IF,  we  set  up  the  TT-block  containing  the  IF  and 
the  other  Tr-blocks  that  are  a-dependent  on  it  to  be  executed  for  the  full  index 
set.  However,  before  these  Tr-blocks  are  executed  we  precompute  (at  compile  or 
run  time)  which  follower  will  be  taken  on  each  particular  iteration  of  the  DO 
loop.  This  is  expressed  as  a  vector  of  mode  bits  for  each  follower.  A  vector 
of  mode  bits  is  a  mask  that  is  applied  to  the  index  set  for  a  given  operation. 
In  this  way  we  can  selectively  omit  certain  elements  at  run  time.  The  end 
result  is  that  each  statement  is  executed  only  for  the  proper  elements.  By 
precomputing  these  mode  bits,  we  are  able  to  fix  the  results  a  priori,  hence  the 
name  prefix  type  B. 
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As  an  example  of  a  prefix  type  B  IF,  consider  the  following  program. 

DO  5  I  =  1,  N 

IF(I<5)  THEN  A(I)  =  B(I)  +  C(I) 

ELSE  A(I)   =  B(I)/C(I) 
5       CONTINUE 
Let  M.[a,b]  be  a  vector  of  mode  bits  denoting  vector  elements  from  a  to  b, 
inclusive.     Then  we  can  compile  the  above  as 

Ml   =  [1,5] 

M2  =  [6,N] 
DO   SIM{A(M1)  =  B(M1)  +  C(M1),  A(M2)  =  B(M2)/C(M2)} 
where  DO  SIM  indicates  that  the  bracketed  statements  can  be  executed  as  array 
statements  and  simultaneously. 

The  postfix  type  B  IFs  are  more  complicated.  Using  only  the  statements 
in  the  TT-block  that  contains  the  IF,  we  compute  all  control  paths  in  parallel 
for  the  full  index  set.  Using  these  results  we  evaluate  the  IF.  Finally,  the 
results  are  merged  to  produce  the  correct  results  and  a  set  of  mode  bits  are 
passed  on  to  other  iT-blocks.  The  details  of  the  test  and  merge  are  now  given. 
What  we  want  to  do  is  find  which  follower  is  taken  for  each  iteration.  Thus 
given  the  outputs  for  each  control  path  for  each  iteration,  we  want  to  thread 
our  way  through  the  outputs,  picking  up  the  proper  results.  It  is  possible  to 
do  this  recursively  by  using  the  previously  selected  results  and  the  results 
from  the  current  iteration  to  determine  which  follower  is  to  be  taken.  Using 
subscripts  to  denote  control  path,  in  Example  7  we  compute  A-,(I)  =  D(I)  and 
A2(I)  =  B(I)*C(I)  for  all  values  of  the  index  set.  Let  M,  and  M2  be  the  vector 
of  mode  bits  for  follower  1  and  follower  2,  respectively,  Whenever  M.(j)  is  a  1, 
then  the  statements  in  the  i-th  follower  are  executed  on  the  j-th  iteration 
in  the  serial  program. 
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Example  7 

DO  10  I  =  6,  N 

IF  A(I-3)  =  5  THEN  A(I)  =  D(I) 

ELSE  A(I)  =  B(I)  *  C(I) 
10   CONTINUE 
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It  should  be  noted,  in  general,  that  ANDing  of  any  two  distinct 
vectors  of  mode  bits  associated  with  an  IF  statement  results  in  a  vector 
of  0  bits.  The  ORing  of  all  the  vectors  results  in  a  vector  of  1  bits 
corresponding  to  the  iterations  on  which  the  IF  statement  is  executed  in 
the  serial  program. 

In  Example  7,  control  path  1  is  taken  whenever  we  were  on  control 
path  1  three  iterations  ago  and  the  output  was  5  or  we  were  on  control  path  2 
three  iterations  ago  and  the  output  was  5.  Control  path  2  is  taken  in  either 
case  if  the  output  was  not  5.  This  can  be  expressed  as  the  following  coupled 
recurrence  relation 

1^(1)  +■  (A-,  (1-3)  =  5  &  M1  (1-3))  V  (A2(I-3)  =  5  &  M2(I-3)) 

M2(I)  <-  (A^I-3)  t   5  &  M-,(I-3))  V  (A2(I-3)  f   5  &  M?(I-3)) 

The  techniques  to  solve  this  recurrence  are  described  in  [21]  with 
specific  reference  to  this  application.  This  is  a  bit-level  recurrence  and 
can  be  done  in  0(log  n)  gate  delays,  where  n  is  the  number  of  iterations. 
In  reality,  we  could  approximate  the  log  n  gate  delays  by  one  clock, 
i .e. ,  one  time  step. 

Finally,  we  mask  A,  with  M, ,  A2  with  M«»  and  merge  the  results  to  get 
the  proper  elements  of  A.  Also  M,  and  M2  are  passed  on  to  other  TT-blocks  as 
needed. 

We  should  point  out  that  postfix  type  B  loops  yield  the  same  speedups 
as  prefix  type  B  loops,  in  general.  The  difference  is  that  since  more  processors 
are  required  for  redundant  operations  here,  postfix  type  B  loops  generally 
have  lower  efficiencies. 

As  a  test  of  the  usefulness  of  our  methods,  we  have  analyzed  the 
16  Fortran  programs  which  appeared  in  1973  in  the  CACM  Algorithms  Section. 
Nested  DOs  were  counted  as  one  loop  at  the  outermost  level.  There  were  a 
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total  of  124  such  DO  loops.  Each  loop  was  characterized  in  terms  of  the 
worst  recurrence  type  (c.f.  Eq.  6)  and  worst  IF  type  it  contained.  Table  1 
is  a  summary  of  our  results. 

We  observe  that  type  A  and  prefix  type  B  IFs  together  with  recurrence 
types  0  and  1  can  be  handled  with  good  speedup  and  efficiency.  This  accounts 
for  85%  of  the  loops.  Four  programs  (type  3  and  type  C)  are  disasters  for 
our  methods  and  in  part  must  be  handled  serially.  The  remaining  programs 
can  be  handled  by  the  postfix  and  wave-front  (type  2)  methods.  Overall,  this 
seems  to  imply  that  the  methods  given  would  be  s/ery   effective  for  the  general 
mix  of  CACM  Fortran  algorithms. 
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4.  Machine  Considerations 

Since  the  early  1960s,  we  have  seen  a  sequence  of  high  speed  machines 
which  have  some  kind  of  mul tioperation  capability.  The  CDC  6600  of  the  60s 
was  succeeded  by  the  7600  in  the  early  70s.  IBM  introduced  the  360/91  and 
its  successors.  The  7600  and  /91  are  both  pipelined  machines  and  achieve 
high  performance  by  operating  on  arrays  of  data.  Their  instruction  sets  are 
rather  traditional,  however.  In  contrast,  the  pipelined  Control  Data  STAR 
[33]  and  Texas  Instruments'  ASC  [68],  both  have  vector  instruction  sets, 
which  should  make  compilation  for  them  substantially  easier. 

On  the  other  hand,  the  Burroughs'  ILLIAC  IV  [4  ]  is  a  parallel 
array  machine,  but  its  instruction  set  is  also  traditional  in  nature. 
Vectors  must  be  broken  up  into  partitions  of  size  64  and  loops  performed 
over  a  sequence  of  such  partitions.  The  Goodyear  Aerospace  STARAN  IV  [6  ] 
is  a  parallel  array  of  processors,  each  of  which  operates  in  a  bit-serial 
fashion.  This  is  an  example  of  an  associative  processor. 

It  is  interesting  to  note  that  the  highest  speed  pipeline 
processors,  the  STAR  and  ASC,  both  resort  to  parallelism  by  providing 
several  parallel  pipelines  to  achieve  their  desired  operating  speeds. 

The  processing  speedups  achieved  by  all  of  these  machines  are 
due  to  parallelism  between  operations  as  well  as  parallelism  between 
memory  and  processor  activities.  We  shall  discuss  memories,  alignment 
networks,  and  control  units  later.  Our  point  here  is  that  in  order  to 
compile  ordinary  serial  languages  for  these  processors,  two  things  are 
desirable:  1)  Powerful  translation  techniques  to  detect  parallelism, 
and  2)  Array-type  machine  languages. 

The  main  contributions  to  program  speedup  discussed  in  section  3 
arise  from  our  loop  distribution  procedure.  This  leads  to  array  operations 


64 

and  recurrences.  Both  of  these  are  well  suited  for  computation  on 
machines  which  must  perform  the  same  operation  on  many  data  elements  to 
achieve  high  performance.  Thus,  the  methods  of  section  3  could  serve  as 
compiler  algorithms  for  such  machines. 

Some  time  ago,  we  implemented  a  comprehensive  analyzer  of  Fortran 
programs.  It  used  algorithms  like  those  of  sections  2  and  3,  although 
some  of  the  techniques  were  much  more  primitive  than  those  discussed  here. 
Details  of  our  algorithms  and  results  may  be  found  in  [39] »  [41 L  [44]  and 
[54].  We  will  summarize  a  few  points  very  briefly. 

Altogether  some  140  ordinary  Fortran  programs,  gathered  from  many 
sources,  were  analyzed.  The  programs  ranged  from  numerical  computations  on 
two-dimensional  arrays  (e.g.,  EISPACK)  to  essentially  nonnumerical  programs 
(e.g.,  Fortran  equivalents  of  GPSS  blocks).  We  set  all  loops  to  10  or  fewer 
iterations  and  analyzed  all  paths  through  the  programs,  computing  T, ,  T  , 
S  ,  E  ,  etc.  These  were  averaged  over  all  traces  and  also  over  collections 
of  programs.  A  plot  of  our  results  for  S  vs.  p  is  shown  in  Fig.  7.  Some 
of  the  points  are  labelled  with  the  name  of  a  collection  of  programs. 
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Our  experiments  lead  us  to  conclude  that  mul tioperation  machines 
could  be  quite  effective  in  most  ordinary  FORTRAN  computations.  Fig.  7 
shows  that  even  the  simplest  sets  of  programs  (GPSS,  for  example,  has 
almost  no  DO  loops)  could  be  effectively  executed  using  16  processors. 
The  overall  average  (ALL  in  Fig.  7)  is  35  processors  when  all  DO  loop 
limits  are  set  to  10  or  less.  As  the  programs  become  more  complex,  128  or 
more  processors  would  be  effective  in  executing  our  programs.  Note  that 
for  all  of  our  studies,  T-.  <_  10,000  so  most  of  the  programs  would  be  classed 
as  short  jobs  in  a  typical  computer  center.  In  all  cases,  the  average 
efficiency  for  each  group  of  programs  was  no  less  than  30%.  While  we  have 
not  analyzed  any  decks  with  more  than  100  cards,  we  would  expect  extra- 
polations of  our  results  to  hold.  In  fact,  we  obtained  some  decks  by 
breaking  larger  ones  at  convenient  points. 

These  numbers  should  be  contrasted  with  current  computer  organiza- 
tions. Presently,  two  to  four  simultaneous  operation  general  purpose 
machines  are  quite  common.  The  pipeline,  parallel  and  associative  machines 
mentioned  above  perform  8  to  64  simultaneous  operations,  but  these  are 
largely  intended  for  special  purpose  use.  Thus,  we  feel  that  our  numbers 
indicate  the  possibility  of  perhaps  an  order  of  magnitude  speedup  increase 
over  the  current  situation. 

Furthermore,  we  took  a  subset  of  the  programs,  again  a  random 
cross-section,  and  varied  the  DO  loop  limits  from  10  to  40.  The  points 
10,  20,  30,  40  correspond  to  the  results.  We  conclude  that  for  our  sample 
of  ordinary  Fortran  programs,  speedup  is  a  linear  function  of  T,  and  hence 
p.  This  is  quite  different  from  some  of  the  folklore  which  has  arisen  about 
parallel  computation  [1  ],  [31],  [52]  (e.g.,  S  =  0(log  p)).  We  will 
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next  give  an  outline  of  two  commonly  held  beliefs  about  machine  organiza- 
tion. 

Let  us  assume  that  for  0  <_  B.  ±  1 ,  0-Bk)  of  the  serial  execution 
time  of  a  given  program  uses  p  processors,  while  3.  of  it  must  be  performed 
on  k  <  p  processors.  Then  we  may  write 

ekTl         Tl  Tl 

Tn  ■  ^~L+  (1-Bj  tt   ,  and  En  = 


p      k  k  p   '        p    UvO-^t, 


i 


1   +  Bk(f-1) 


For  example,   if  k  =  1,   p  =  33,  and  6,   =  1/16,   then  we  have  E--  =  1/3.     This 

means  that  to  achieve  E33  =  1/3,   15/16  of  T-,  must  be  executed  using  all    33 
processors,  while  only  1/16  of  T,   may  use  a   single  processor.     While  E33  =  1/3 
is  typical   of  our  results   (see  Fig.   7),   it  would  be  extremely  surprising  to 
learn  that  15/16  of  T,   could  be  executed  using  fully  33  processors.     This 
kind  of  observation  led  Amdahl    [1   ]  and  others  [20]   [63]  to  conclude  that 
computers  capable  of  executing  a  large  number  of  simultaneous  operations 
would  not  be  reasonably  efficient—or,   to  paraphrase  them,    "Ordinary  pro- 
grams have  too  much  serial   code  to  be  executed  efficiently  on  a  mul tioperation 
processor." 
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Such  arguments  have  an  invalidating  flaw,  however,  in  that 
they  assume  k  =  1  in  the  above  efficiency  expression.  Evidently  no  one 
who  repeated  this  argument  ever  considered  the  obvious  fact  that  k  will 
generally  assume  many  integer  values  in  the  course  of  executing  most  pro- 
grams. Thus,  the  expression  for  E  given  above  must  be  generalized  to 
allow  all  values  of  k  up  to  some  maximum. 

The  technique  used  in  our  experiments  for  computing  E  is  such  a 
generalization.  For  some  execution  trace  through  a  program,  at  each  time 
step  i,  some  number  of  processors  k(i)  will  be  required.  If  the  maximum 
number  of  processors  required  on  any  step  is  p,  we  compute  the  efficiency 

for  any  trace  as 

T 
P 

}   k(l) 

E  =  -Jk-j ,  assuming  p  processors  are  available. 

p   p  p  p 

Apparently  no  previous  attempt  to  quantify  the  parameters  discussed  above 
has  been  successful  for  a  wide  class  of  programs.  Besides  Kuck,  et  al  [41] 
the  only  other  published  results  are  by  Baer  and  Estrin  [3  ],  who  report  on 
five  programs. 

Another  commonly-held  opinion,  which  has  been  mentioned  by  Minsky 
[52],  is  that  speedup  S  is  proportional  to  log  p.  Flynn  [31]  further  dis- 
cusses this,  assuming  that  all  the  operations  simultaneously  executed  are 
identical.  This  may  be  interpreted  to  hold  1)  over  many  programs  of 
different  characteristics,  2)  for  one  fixed  program  with  a  varying  number 
of  processors,  or  3)  for  one  program  with  varying  DO  loop  limits.  That 
the  above  is  false  under  interpretation  1  for  our  analysis  is  obvious  from 
Fig.  7.  Similarly,  it  is  false  under  interpretation  2  as  the  number  of 
processors  is  varied  between  1  and  some  number  as  plotted  in  Fig.  7.  As  p 
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is  increased  still  farther,  the  speedup  and  efficiency  may  be  regarded  as 
constant  or  the  speedup  may  be  increased  at  a  decreasing  rate  together 
with  a  decreasing  efficiency.  Eventually,  as  p  becomes  arbitrarily  large, 
the  speedup  becomes  constant  and  in  some  region  the  curve  may  appear 
logarithmic.  Under  interpretation  3,  there  are  many  possibil ities--pro- 
grams  with  multiply  nested  DO  loops  may  have  speedups  which  grow  much 
faster  than  linearly,  and  programs  without  DO  loops  of  course  do  not 
change  at  all . 

One  of  Flynn's  arguments  in  support  of  such  behavior  concerned 
branching  inside  loops.  For  example,  if  on  each  pass  through  a  loop  we 
branched  down  some  new  path,  not  taken  on  any  other  iteration,  then  indeed 
we  might  have  disastrous  results.  However,  we  have  not  observed  such 
intense  splitting,  in  fact  most  computations  which  do  branch  inside  a 
loop,  come  together  in  common  followers  rather  quickly.  Furthermore, 
there  are  usually  relatively  few  distinct  paths  through  a  loop  in  most 
cases.  The  results  of  Table  I  lend  further  support  to  our  conclusion 
that  IFs  in  loops  do  not  pose  a  serious  practical  problem. 

Abstractly,  it  seems  of  more  interest  to  relate  speedup  to  T, 
than  to  p.  Based  on  our  data,  we  offer  the  following  observation: 
for  many  ordinary  FORTRAN  programs  (with  T,  <_  10,000),  we  can  find  p  such 
that 

1)  T  =  alog2  T1      for  2  <  a  £  10  , 

and      2)   p  < 


-  .6  log2T.j 
such  that 


3)  Sp^lOlog^  and  Ep  >  '3  • 
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The  average  a  value  in  our  experiments  was  about  9.  However, 

the  median  value  was  Ipss  than  a     •--.•« 4.u 

■  ue  was  less  than  4,  since  there  were  several  very   large 


values 
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In  terms  of  our  present  theoretical  results,  it  is  not  hard  to 
justify  such  estimates.  Consider,  for  example,  Corollary  4  which  is 
intended  for  solving  R<n,m>  systems  with  small  n  using  a  limited  number 
of  processors.  Let  us  approximate  the  times  by 

T-,  ~  mn 

2         Tl 
and      T  a  2m  n/p  *  2m  -L 

since  most  programs  have  m  =  1  or  m  =  2.  Following  the  observation,  we 
choose  p  *  T-,/log  T, ,  so 

Tp  a  2m  log  T] 

and      S  a  T,/2m  log  T, 

and      E  a  1/2  m  , 

all  of  which  agree  reasonably  with  the  observation. 

We  should  point  out  that  our  previous  analyzer  allowed  each 
processor  to  be  executing  a  different  operation  on  any  time  step.  This 
could  degrade  our  results,  possibly  by  a  factor  of  two. 

However,  the  newly  discovered  algorithms  perform  better  than  the 
ones  we  used  in  our  old  analyzer.  We  are  currently  implementing  a  new 
analyzer  with  which  we  expect  to  obtain  better  results  than  those  summarized 
above. 

For  a  more  complete  discussion  of  theoretical  bounds  on  the  time 
and  processors  required  for  executing  whole  Fortran  loops  see  [l9]»  [24]. 
These  references  show  how  certain  details  about  overall  program  graphs  can 
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be  combined  with  the  results  of  section  3  to  obtain  loop  bounds.  From 
these,  whole  program  bounds  can  be  obtained. 

Control  Units 

A  well -designed  control  unit  is  one  which  never  gets  in  the  way 
of  the  processor(s)  and  memories.  In  other  words,  it  operates  fast  enough 
to  be  able  to  supply  instructions  whenever  they  are  needed  in  the  proces- 
sing and  moving  of  data.  Control  units  tend  to  become  complex,  mainly 
in  a  timing  sense,  because  they  may  have  a  number  of  tasks  to  control. 

One  way  to  ease  some  control  unit  difficulties  is  to  use 
parallelism  at  the  control  unit  level.  A  multiprocessor  is  an  example; 
several  complete  control  units  are  used.  This  may  be  rather  expensive, 
so  the  multi-function,  pipeline  and  parallel  processor  machines  use  one 
shared  control  unit.  Such  control  units  often  contain  a  number  of  inde- 
pendently operating  parts.  For  example,  the  first  use  of  pipelining  was 
in  control  units  [14].  A  detailed  study  of  the  control  unit  of  any  high- 
speed computer  will  reveal  a  number  of  simultaneously  operating,  independent 
functions.  While  this  may  allow  the  functions  to  operate  more  slowly,  it 
also  causes  some  synchronization  problems. 

As  we  mentioned  in  our  processor  discussion,  the  level  of  machine 
language  is  very  important  in  modern,  high-speed  computers.  Vector  in- 
struction sets  make  compiler  writing  easier.  They  also  focus  control  unit 
design  on  the  correct  questions,  namely,  to  execute  vector  functions  at 
high  speed. 

Control  units  for  high-speed  computers  must  handle  the  traditional 
functions,  including  instruction  decoding  and  sequencing,  I/O  and  interrupt 
handling,  and  address  mapping  and  memory  indexing.  In  addition,  we  can 
list  several  new  functions.  For  one,  memory  indexing  becomes  somewhat  more 
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complex  when  whole  arrays  are  being  accessed  in  large  parallel  memories. 
Also,  array  computers  (parallel  or  pipeline)  often  rely  on  the  control  unit 
for  scalar  computations.  Broadcasting  of  scalars  to  an  array  must  also 
be  handled.  Special  control  features  such  as  IF  tree  processing  [27] 
can  be  effectively  handled  in  the  control  unit.  Special  stacks  and  queues 
may  be  required  to  handle  a  number  of  processors  and  programs  in  rapid 
succession.  Indeed,  instruction  level  multiprogramming  may  even  be 
attempted. 

Rather  than  discuss  any  of  these  in  detail,  we  simply  refer  the 
reader  to  a  detailed  study  of  the  several  high-speed  machine  papers 
mentioned  earl ier. 

We  conclude  this  section  with  the  computer  organization  of  Fig.  8, 
The  control  unit  can  really  be  regarded  as  four  control  units,  one  for 
each  of  the  four  other  major  subsystems  shown.  The  operation  of  this 
machine  can  be  regarded  as  a  pipeline  from  memory  to  memory.  For  move 
instructions  (memory  to  memory)  the  processors  can  be  bypassed. 

Fig.  8  represents  parallelism  at  a  number  of  levels:  within  the 
control  unit,  processors,  memories  and  data  alignment  networks.  Also,  it 
contains  parallelism  in  the  simultaneity  of  operation  of  each  of  these 
which  forms  a  pipeline.  Note  that  pipelining  can  also  be  used  within 
each  of  the  five  major  subsystems  to  match  bandwidths  between  them. 

The  details  of  accessing  parallel  memories  and  of  aligning  the 
accessed  data  will  be  discussed  in  the  next  section. 


73 


Parallel  Memory  Access 

As  effective  speeds  of  processing  units  have  increased,  memory 
speeds  have  been  forced  to  keep  up.  This  has  partly  been  achieved  by  new 
technologies  (magnetic  cores  to  semiconductors).  But  technology  has  not 
been  enough,  as  evidenced  by  the  fact  that  in  1953,  the  first  core  memory 
operating  (in  Whirlwind  I)  had  an  8  ys  memory  cycle  time.  Today,  most 
computer  designers  cannot  afford  to  use  memories  much  faster  than  one 
hundred  nanoseconds.  Thus,  we  have  achieved  an  increase  of  only  two  orders 
of  magnitude  in  memory  speed  over  the  past  twenty  odd  years. 

In  the  same  period,  the  fastest  processor  operation  times  have 
advanced  from  a  few  tens  of  microseconds  to  a  few  tens  of  nanoseconds;  or 
three  orders  of  magnitude.  Memory  system  speeds  have  kept  up  with  proces- 
sors only  through  the  use  of  parallelism  at  the  word  level.  In  the  late 
1950s,  ILLIAC  II  and  the  IBM  STRETCH  introduced  the  first  two-way  inter- 
leaved memories.  At  the  present  time,  high  speed  computers  have  on  the 
order  of  100  parallel  memory  units.  If  a  word  can  be  fetched  from  each  of 
m  memory  units  at  once,  then  the  effective  memory  bandwidth  is  increased 
by  a  factor  of  m. 

Array  Access 

Parallel  memories  are  particularly  important  in  array  computers 
(parallel  or  pipeline).  Thus,  if  a  machine  has  m  memory  units  we  can  store 
one-dimensional  arrays  across  the  units  as  shown  in  Fig.  9,  for  m  =  4. 
While  the  first  m  operands  are  being  processed,  we  can  fetch  m  more,  and 
so  on.  But,  if  the  array  is  indexed  such  that,  say  only  the  odd  elements 
are  to  be  fetched,  then  the  effective  bandwidth  is  cut  in  half  due  to  access 
conflicts  as  shown  in  the  underlined  elements  of  Fig.  9.  These  conflicts 
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can  be  avoided  by  choosing  m  to  be  a  prime  number.  Then,  any  index  distance 
relatively  prime  to  m  can  be  accessed  without  conflicts. 

Many  programs  contain  multidimensional  arrays.  These  can  lead 
to  more  difficult  memory  access  problems,  since  we  may  want  to  access  rows, 
columns,  diagonals,  back  diagonals  (as  in  the  wave-front  method  of  section 
3  ),  square  blocks,  and  so  on.  For  simplicity,  consider  two-dimensional 
arrays  and  assume  we  want  to  access  n  element  partitions  of  arrays  from 
parallel  memories  with  m  units. 

Consider  the  storage  scheme  shown  in  Fig.  10  where  m  =  n  =  4. 
Clearly,  using  this  storage  scheme  we  can  access  any  row  or  diagonal  (e.g., 
the  circled  main  diagonal)  without  conflict.  But  all  the  elements  of  a 
column  (e.g.,  the  underlined  first  column)  are  stored  in  the  same  memory 
unit  so  accessing  a  column  would  result  in  memory  conflicts,  i.e.,  we 
would  have  to  cycle  the  memories  n  times  to  get  the  n  elements  of  a  column. 

In  order  to  allow  access  to  row  and  column  n-vectors,  we  can 
skew  the  data  as  shown  in  Fig.  11  [38].  Now,  however,  we  can  no  longer 
access  diagonals  without  conflict.  It  can  be  shown,  in  fact,  that  there  is 
no  way  to  store  an  mxm  matrix  in  m  memories,  when  m  is  even,  so  that 
arbitrary  rows,  columns,  and  diagonals  can  be  fetched  without  conflicts. 
However,  as  we  shall  see,  by  using  more  than  m  memories  we  can  have  conflict- 
free  access  to  any  row,  column,  or  diagonal  ,  as  well  as  other  useful  m- 

vectors. 

2k 
It  is  easy  to  show  that  if  m  =  2   +1,  for  any  integer  k,  we 

have  conflict-free  access  to  rows,  columns,  diagonals,  back  diagonals,  and 

square  blocks.  For  an  example  with  k  =  1,  see  Fig.  12.    This  and  other 

similar  results  are  discussed  in  [15].  If  m  is  not  a  power  of  two,  certain 
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difficulties  arise  in  indexing  the  memory.  Also,  note  that  the  elements 
of  various  partitions  are  accessed  in  scrambled  order.  A  crossbar  switch 
could  be  used  to  unscramble  the  data,  but  much  cheaper  schemes  can  be 
devised.  The  question  of  unscrambling  the  accessed  elements  using  a 
rather  simple  network  is  discussed  by  Swanson  [66]. 

In  order  to  simplify  indexing  and  unscrambling,  systems  of  the 
form  m  =  2n  were  considered  by  Lawrie  in  [47]  and  [48]-  He  shows  that 
conflict-free  access  to  a  number  of  partitions  is  possible  using  such  a 
memory.  We  illustrate  this  in  Fig.  13  with  m  =  2n  =  8.  We  will  discuss 
data  alignment  networks  for  unscrambling  the  data  accessed  in  such  a 
memory  later  in  this  section. 

In  order  to  implement  a  skewing  scheme,  we  must  have  a  properly 
designed  parallel  memory  system.  In  particular,  each  of  the  m  memory  units 
must  have  an  independent  indexing  mechanism.  This  allows  us  to  access  a 
different  relative  location  in  each  memory  unit.  It  is  interesting  to 
observe  that  several  presently  existing  high  speed  computers  have  handled 
their  parallel  memories  in  different  ways. 

The  Control  Data  STAR,  for  example,  does  not  allow  independent 
indexing  of  each  memory  unit.  Instead,  it  has  an  instruction  by  which 
arrays  can  be  physically  transposed  in  memory  to  provide  access  to,  say, 
rows  and  columns.  The  transpose  time  is  essentially  wasted  time  and  some 
algorithms  for  these  machines  are  slowed  down  by  as  much  as  a  factor  of 
two  in  this  way. 

ILLIAC  IV  has  independent  index  registers  and  index  adders  on 
each  of  m  =  64  memories.  Since  it  has  64  processors,  access  to  partitions 
of  n  =  64  elements  is  usually  required.  Thus  the  skew  scheme  of  Fig.  11 
is  easily  implemented.  Of  course,  since  m  is  even,  conflict-free  access 
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to  rows,  columns,  and  diagonals  is  impossible.  But  as  Fig.  11  shows, 
diagonals  may  be  accessed  in  just  two  memory  cycles. 

Parallel  Random  Access 

The  execution  of  Fortran-like  programs  frequently  leads  to  memory 
access  requirements  which  include  one-dimensional  arrays  and  various 
partitions  of  multidimensional  arrays,  as  we  have  been  discussing.  However, 
we  sometimes  face  access  problems  which  have  much  less  regularity. 

For  example,  consider  the  subscripted  subscript  case: 

DO  I  =  1,  N 

X(I)  =  A(B(I))  . 
Here,  we  have  no  idea  at  compile  time  about  which  elements  of  A  are  to  be 
fetched,  assuming  that  B  was  computed  earlier  in  the  program.  This  easily 
generalizes  to  multidimensional  arrays.  Frequently,  table  lookup  problems 
are  programmed  in  this  way. 

To  deal  with  this  kind  of  memory  access  problems,  is  in  general 
to  deal  with  random  access  to  a  parallel  memory.  Note  that  this  is  a 
problem  which  has  been  given  a  good  deal  of  attention  for  multiprocessor 
systems  using  rather  abstract  models  of  various  kinds. 

There  are  two  key  questions  on  which  the  validity  and  usefulness 
of  these  models  turns.  They  are: 

1)  What  kind  of  data  dependence  is  assumed  in  the  memory  access 
sequence? 

2)  What  kind  of  queueing  mechanism  is  assumed  for  retaining 
unserviced  accesses? 

In  these  terms,  we  briefly  summarize  some  of  the  results. 
Hellerman's  model  [32]  can  most  reasonably  be  interpreted  to  assume  no  data 
dependence  between  successive  memory  accesses  and  to  have  no  provision  to 
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queue  conflicting  addresses.  It  is  also  a  steady  state  model,  ignoring 
control  dependence.  Thus,  it  scans  an  infinite  string  of  addresses, 
blocking  when  it  finds  the  first  duplicate  memory  unit  access  request. 

In  various  models,  Coffman  and  his  co-workers  [16,  17,  26  ] 
extended  the  above  to  include  a  type  of  queueing  and  to  separate  data 
accesses  from  instruction  accesses.  These  papers  further  introduced 
address  sequences  which  were  not  necessarily  uniformly  distributed.  These 
models  also  assumed  that  no  data  dependencies  existed  in  the  address 
sequence. 

Ravi  [58]  introduced  a  model  which  was  more  realistic  for  multi- 
processor machines.  He  allows  each  processor  to  generate  an  address  and 
computes  the  number  of  them  which  can  be  accessed  without  conflict,  in  a 
steady  state  sense.  Effectively  he  assumes  a  sequential  data  dependence 
in  the  addresses  generated  by  each  processor. 

In  [18]  the  above  results  are  extended  in  several  ways.  First, 
it  is  shown  analytically  that  the  model  of  [58]  yields  an  effective  memory 
bandwidth  which  is  linear  in  the  number  of  memory  units.  Several  models 
are  given  with  queues  in  the  processors  and  in  the  memories,  to  show  the 
differing  effects  on  bandwidth  of  such  queues  and  methods  used  for 
managing  the  queues.  Several  types  of  data  dependencies  are  assumed  to 
exist;  some  as  in  the  Ravi  model  and  others  which  include  dependencies 
between  the  processors.  In  all  of  these  models,  we  show  that  the  effective 
bandwidth  of  m  memories  can  be  made  to  be  0(m).  The  models  are  useful  for 
either  multiprocessor  or  parallel  machines. 

Thus,  we  conclude  that  for  parallel  or  multiprocessor  machines, 
the  proper  use  of  m  parallel  memories  can  lead  to  effective  bandwidths 
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1 12. 
which  are  0(m).  This  is  much  more  encouraging  than  the  0(m  '  )  which  was 

derived  from  earlier,  more  naive  models. 

Alignment  Networks 

Finally,  we  consider  the  problem  of  interconnecting  the  processors 
and  memories  we  have  been  examining.  Data  alignment  requirements  depend 
on  the  programs  to  be  run  and  the  machine  organization.  A  simple  way  to 
connect  several  memories  to  a  processor  is  to  use  a  shared  bus.  For  higher 
speed  operation,  multiprocessors  often  use  a  crossbar  switch  which  allows 
each  processor  to  be  connected  to  a  different  memory  simultaneously.  In 
the  ILLIAC  IV  array,  the  i-th  processor  can  pass  data  to  processors  i  ±  1 
and  i  ±  8,  modulo  64.  Here  all  processors  must  route  data  the  same  distance 
in  a  uniform  way. 

None  of  the  above  techniques  is  well  suited  to  a  high  performance 
parallel  computer.  Indeed,  the  alignment  network  should  be  driven  by  an 
independent  control  unit,  to  operate  concurrently  with  the  processor  and 
memory  operation.  The  requirements  include  more  than  uniform  shifts  and  at 

times  even  more  than  permutations.  Often  broadcasts  are  needed,  including 

1/2 
partial  and  multiple  simultaneous  broadcasts,  e.g.,  n    numbers,  each 

1 12 
broadcast  to  n  '   processors  for  matrix  multiplication  in  an  n  processor 

machine  [48]. 

The  alignment  network  should  be  able  to  transmit  data  from  memory 

to  memory  and  processor  to  processor  as  well  as  back  and  forth  between 

memories  and  processors.  The  connections  it  must  provide  are  derived  from 

two  sources.  For  one,  it  must  be  able  to  handle  the  indexing  patterns 

found  in  existing  programs,  for  example,  the  uniform  shift  of  5  necessary  in 
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A(I)  +  A( 1+5) .  For  another,  it  must  be  able  to  scramble  and  unscramble 
the  data  for  memory  accesses.  For  example,  to  add  a  row  to  a  column,  one 
of  the  partitions  must  be  "unskewed".  More  details  can  be  found  in  [39], 
[47]  and  [43]. 

With  a  crossbar  switch  we  can  perform  any  one-to-one  connection 
of  inputs  to  outputs,  and  with  some  modification  we  can  also  make  one-to- 
many  connections  for  broadcasting.  The  switch  can  be  set  and  data  can  be 

transmitted  in  0(log  n)  gate  delays.  However,  the  cost  of  such  a  switch 

2 
is  quite  high,  namely,  0(n  )  gates.  Thus  for  large  systems,  a  crossbar 

alignment  network  is  out  of  the  question. 

Another  possibility  is  the  rearrangeable  network.  It  is  shown 
by  Benes  [8  ]  that  such  a  switch,  with  the  same  connection  capabilities  as 
a  crossbar,  can  be  implemented  using  only  0(n  log  n)  gates.  The  time 
required  to  transmit  data  through  the  network  is  just  0(log  n).  Un- 
fortunately, the  best  known  time  to  set  up  the  network  for  transmission  is 
0(n  log  n)  [56].  This  control  time  renders  the  network  impractical  as  an 
alignment  network,  unless  all  connection  patterns  could  be  set  up  at  compile 
time. 

The  Batcher  sorting  network  [5  ]  is  another  possibility.  Not 

only  can  it  perform  the  connections  of  a  crossbar  switch,  it  can  also  sort 

2 
its  inputs,  if  desired.  This  network  has  0(n  log  n)  gates,  so  it  is  an 

2 
improvement  over  the  crossbar.  However,  it  requires  time  of  0(log  n)  gate 

delays  for  control  and  data  transmission,  making  it  faster  than  the  Benes 

approach. 

As  a  final  possibility,  the  fi  network  [48]  proposed  specifically 

for  this  purpose  can  be  controlled  and  transmit  data  in  0(log  n)  gate  delays, 
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but  contains  only  0(n  log  n)  gates.  Thus  it  has  the  speed  of  a  crossbar 
with  the  cost  of  a  Benes  network.  Its  shortcoming  is  that  it  cannot 
perform  arbitrary  interconnections.  However,  as  discussed  above,  we  seek 
an  alignment  network  which  can  handle  the  requirements  posed  by  program 
subscripts  and  memory  skewing  schemes.  Lawrie  has  examined  a  number  of 
such  questions  and  the  ft  network  satisfies  many  of  them. 

It  is  interesting  to  note  that  the  ft  network  consists  of  a 
sequence  of  identical  interconnection  paths  called  shuffles,  see  e.g., 
[48]>  [57J-  We  call  transmission  from  left  to  right  a  shuffle  and  from 
right  to  left  an  unshuffle.  It  can  be  shown  that  the  Benes  and  Batcher 
networks,  as  well  as  the  ft  network,  can  all  be  constructed  from  a  series 
of  shuffle  and  unshuffle  interconnections  of  2*2  switching  elements.  The 
switching  elements  are  basically  2x2  crossbars.  In  the  Batcher  network, 
they  have  the  further  capability  of  comparing  their  inputs  and  switching 
on  this  basis.  In  the  ft  network  they  can  also  broadcast  either  of  their 
inputs  to  both  outputs. 
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A  Cost-Effectiveness  Measure 

One  important  motivation  for  studying  the  structure  of  parallel 
algorithms  is  to  determine  computer  architectures  for  effectively  executing 
these  algorithms.  We  are  interested  in  computer  requirements  as  measured 
in  terms  such  as  number  of  processors,  type  of  processors,  number  of 
memories,  and  type  of  interconnections.  We  measure  the  effectiveness  of  an 
architecture  for  some  algorithm  in  terms  of  its  speedup  over  a  single 
processor  and  the  efficiency  of  the  evaluation  of  that  algorithm. 

In  section  3,  we  defined  speedup  categories  as  a  way  of  charac- 
terizing parallel  algorithms.  These  were  given  in  terms  of  T,  rather  than, 
say,  n,  the  size  of  some  array,  in  order  to  compare  dissimilar  algorithms. 
Such  a  hierarchy  is  useful  for  comparing  algorithms,  but  it  may  not  be 
useful  as  a  quality  or  effectiveness  measure  for  machine  computation.  In 
practice,  which  algorithm  one  chooses  depends  on  many  factors,  an 
important  one  being  the  number  of  processors  available. 

The  speedup  types  mentioned  above  are  useful  in  algorithm  se- 
lection if  one  has  a  machine  which  is  relatively  \/ery   large  compared  to 
the  sizes  of  his  problems.  But  categorizing  algorithms  on  the  basis  of 
speedup  alone  is  unsatisfactory  in  general,  since  efficiency  is  important 
if  unlimited  processors  are  not  available.  On  the  other  hand,  using 
efficiency  as  the  sole  measure  of  effectiveness  is  too  conservative,  since 
a  serial  computation  always  has  maximum  efficiency,  i.e.,  E,  =  1.  Thus, 
we  turn  to  a  measure  which  includes  both  effectiveness  and  cost. 

The  cost  of  a  computation  is  clearly  related  to  the  number  of 
processors  required  and  the  time  they  must  be  used.  We  will  assume  the 
processor,  time  product  reasonably  represents  cost,  so  we  define  cost  as 
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P       P 

Assume  that  we  have  two  algorithms,  A,  and  A?,  which  compute  the 

same  function  using  p,  and  p2  processors,  respectively.  Thus,  we  should 

S. 
decide  which  algorithm  to  use  on  the  basis  of  max 


pi 


cp, 


the  ratios  of  effectiveness  to  cost, 


S    S    E 
We  can  rewrite  this  measure  as  7^  =  -£-  =  =£  .  Note  that  since 

CP   PTP   TP 
E  S 

E  <_  1  and  T  >  1 ,  j  £  1 ;  hence,  ^  £  1  . 
p  "       p  "     p  "  p  " 

E    E  S 
Another  way  of  viewing  this  measure  is  to  write  J^"  =  t  ^  ■ 

P    'l 

Note  that  for  a  given  function,  T-,  is  fixed  so  for  various  parallel 
algorithms  we  are  attempting  to  maximize  the  product  of  efficiency  and 
speedup.  Since  E  <_  1  and  S  <  T, ,  division  by  T,  is  simply  a  normalization 

of  the  maximum  product  to  unity. 

In  Fig.  14  we  present  a  summary  of  S/C  values  for  a  number  of 
parallel  computations.  This  is  discussed  in  detail  in  [40].  We  shall  make 
a  few  summary  remarks  here. 

Generally  speaking,  the  best  parallel  algorithms  in  the  S/C  sense 
are  at  the  top  and  the  worst  are  at  the  bottom.  The  column  at  the  left 
represents  algorithms  running  on  machines  in  a  less  than  maximum  possible 
speedup  way.  The  right  column  contains  the  best  known  speedups,  in  general. 

By  studying  various  algorithm  and  machine  organizations  for  a 
given  problem,  we  could  hope  to  maximize  the  S/C  measure  for  some  set  of 
computations.  For  example,  the  problem  of  multiplying  n  matrices  is  shown 
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?       3         4  . 

in  three  places,  for  p  =  n  ,  p  =  n  and  p  =  n  .  Note  that  S/C  is  maximum 

for  p  =  n  ,  less  than  the  maximum  possible  speedup  for  this  problem.  In 
[40]  several  algorithms  are  compared  in  this  way. 

Finally,  we  point  out  that  for  an  analysis  such  as  the  one  we  have 
presented  here  to  be  practically  meaningful,  some  machine  details  must  also 
be  included.  Analyses  similar  to  the  above  may  be  performed  on  various  parts 
of  a  computer  system.  For  example,  bit  serial  and  bit  parallel  arithmetic 
units  may  be  compared  in  a  similar  way.  Various  interconnection  networks  can 
also  be  studied  in  this  way.  Thus  we  would  be  able  to  make  tradeoffs  in 
parallelism  from  the  level  of  arithmetic  algorithms  (i.e.,  hardware)  up 
through  the  level  of  computational  algorithms  (i.e.,  programs).  Consider 
the  extra  gates  required  for  carry  lookahead  arithmetic  over  bit  serial 
arithmetic.  These  could  be  more  cost-effectively  used  in  multiple  micro- 
processors in  a  machine  properly  designed  for  some  set  of  algorithms. 
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5.  Conclusions 

Parallelism  in  machine  organization  has  been  observed  and  ex- 
ploited since  the  time  of  Babbage  [44].  Since  the  early  1960s,  multioperation 
machines  have  existed  in  implementations  like  the  CDC  6600,  CDC  STAR, 
TI  ASC,  Burroughs'  ILLIAC  IV,  and  so  on.  Compilers  for  these  machines  have 
been  uniformly  difficult  to  write—mainly  for  two  reasons.  First,  it  has 
been  difficult  to  discover  operations  in  ordinary  programs,  which  can  be 
executed  simultaneously  on  such  machines.  Second,  the  organizations  of 
these  machines  have  often  made  it  difficult  to  implement  array  operations 
in  efficient  ways. 

In  this  paper  we  have  discussed  both  of  these  problems.  The 
first  can  be  solved  by  using  several  compilation  techniques  aimed  at 
parallelism  detection  and  exploitation.  This  involves  building  a  dependence 
graph  in  a  careful  way  and  then  being  able  to  compile  trees,  recurrences 
and  IF  statements  for  execution  in  fast  ways.  For  most  ordinary  FORTRAN 
programs,  this  will  probably  lead  to  theoretical  speedups  which  grow  (nearly) 
linearly  in  the  number  of  processors  used  in  a  computation. 

The  second  problem  is  closely  related  to  the  first.  If  one  can 
transform  most  programs  into  array  form,  then  proper  hardware  is  necessary 
to  support  their  fast  execution.  This  means  we  must  have  a  control  unit 
capable  of  executing  array  instructions  in  a  more  or  less  direct  way.  For 
example,  sufficiently  many  registers  are  required  to  contain  all  of  the 
relevant  parameters  of  loops,  their  indexing  and  array  subscripting. 
Furthermore,  we  must  have  a  high  bandwidth  memory  which  provides  conflict- 
free  access  to  arrays  and  various  commonly  needed  partitions.  We  must  also 
be  able  to  align  such  partitions  with  others  for  processing.  Finally, 
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sufficiently  high  processing  bandwidth  must  be  provided --most  likely  by 
a  combination  of  parallel  and  pipeline  techniques.  To  complete  the  circle, 
the  control  unit  must  be  sufficiently  pipelined  to  provide  a  continuous 
flow  of  data  in  a  memory  to  memory  (or  register  to  register)  way. 

While  many  aspects  of  the  problem  are  well  understood  now,  no 
machine  exists  which  combines  all  of  the  necessary  features  in  an  adequate 
way.  These  ideas  are  generally  regarded  as  techniques  for  yery   high-speed 
computation.  However,  the  supercomputers  of  one  day  are  often  the  ordinary 
computers  of  a  later  day  and  perhaps  the  minicomputers  of  the  future. 
Clearly,  good  architectural  ideas  for  machine  speedup  and  ease  of  compiler 
writing  will  be  widely  useful.  Indeed,  with  the  arrival  of  sixteen-bit 
microprocessors,  we  seem  to  be  quite  close  to  the  time  when  large  numbers 
of  simple  processors  can  be  used  as  components  in  one  super  processor. 

We  emphasize  the  need  for  program  analysis  to  determine  which 
machine  organizations  are  useful  for  various  classes  of  programs.  In  this 
paper  we  have  concentrated  on  Fortran-like  programs.  The  techniques 
described  here  have  been  adapted  to  other  languages.  For  example,  in  [28], 
all  of  the  important  blocks  of  GPSS  were  analyzed.  Using  these  results, 
a  machine  organization  was  proposed  for  the  fast  execution  of  GPSS  programs. 
Since  little  arithmetic  is  involved,  designs  for  the  memory,  control  unit 
and  alignment  network  were  emphasized. 

Similarly,  in  [65],  a  number  of  COBOL  programs  were  analyzed  using 
these  techniques.  There  the  memory  hierarchy  requirements  became  obvious. 
Also,  the  ability  to  execute  one  program  on  a  number  of  successive, 
independent  data  sets  was  necessary.  Again,  little  arithmetic  was  required. 
The  control  unit  became  a  key  to  obtaining  high  performance—too  much 
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complexity  in  the  control  unit  was  a  danger. 

We  have  also  attempted  to  exploit  parallel  and  pipeline  tech- 
niques in  information  retrieval  and  file  processing  problems  [34]  and  [64]. 
Here,  since  there  are  no  standard  programming  languages,  standard 
algorithms  (e.g.,  list-merging)  were  studied. 

It  seems  clear  that  by  combining  the  control  notions  discussed 
in  this  paper  with  various  data  structure  transformations,  other  programming 
languages  could  be  analyzed  and  good  machine  parameters  discovered.  For 
example,  instead  of  arithmetic  tree-height  reduction  and  fast  recurrence 
handling  methods,  various  string  manipulations  or  tree  and  graph  algorithms 
could  be  used--these  could  be  interpreted  in  appropriate  languages,  e.g., 
SNOBOL,  LISP,  etc. 

Another  approach  to  machine  design  is  to  avoid  programs  and  go 
directly  to  the  algorithms  themselves.  We  have  carried  out  the  direct 
analysis  of  several  algorithms  including  [61]  and  [62].  In  general,  better 
speedup  results  can  be  obtained  in  this  way  since  one  avoids  the  difficulties 
of  programming  language  artifacts.  In  fact,  while  few  nonlinear  recurrences 
are  found  in  real  programs,  we  often  are  limited  in  obtaining  faster  numerical 
algorithms  by  nonlinear  recurrences.  By  hand,  the  algorithm  is  reorganized 
into  a  potentially  fast  form  which  contains  a  nonlinear  recurrence  and, 
hence,  cannot  be  handled  by  known  methods. 

While  some  nonlinear  recurrences  can  be  treated  analytically  and 
*  others  can  be  shown  to  be  intractable,  this  area  is  generally  not  well 
understood.  The  hand  analysis  of  whole  algorithms  provides  good  nonlinear 
recurrence  research  problems.  Of  course,  even  if  we  cannot  analytically 
speed  up  nonlinear  recurrences,  this  does  not  mean  that  there  is  no  hope. 
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In  fact,  hardware  tricks  can  be  used  to  get  around  some  nonl inearities. 
At  the  bit  level,  several  examples  of  this  are  discussed  in  detail  in 
[21],  for  example,  binary  multiplication. 

To  summarize,  we  have  been  discussing  the  structure  of  programs, 
the  structure  of  machines  and  the  relation  between  the  two.  To  discover 
ultimate  speed  machines,  or  just  to  find  low  cost,  high  performance 
machine  designs,  such  studies  are  important.  By  obtaining  a  better  under- 
standing of  the  structure  of  real  programs  we  can  determine  good  compiler 
algorithms  for  exotic  machine  organizations.  Furthermore,  there  are 
benefits  for  standard  machines;  for  example,  we  can  hope  to  improve  paging 
performance  for  virtual  memory  machines  by  understanding  the  control  and 
data  flow.  Also,  by  transforming  a  program  into  simpler  forms--for 
example,  removing  IFs  from  loops  and  bringing  out  the  array  nature  of 
programs--we  can  hope  to  aid  programmers  in  understanding  and  debugging 
their  programs.  Finally,  by  viewing  compilation  in  a  broad  way  we  can 
see  that  certain  traditional  speedup  techniques  of  logic  design  are 
identical  to  methods  useful  in  compilers  for  mul tioperation  machines. 
These  include  tree-height  reduction  and  fast  linear  recurrence  solving. 
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Figure  2.  Tree-Height  Reduction  by  Associativity 
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Figure  3.  Tree-Height  Reduction  by  Commutativity 
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Figure  4.  Tree-Height  Reduction  by  Distributivity 
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Figure  5.  Program  Graph 
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Figure  6.  Distributed  Graph 


102 


«. 


10  MISC 


M 


LOO  2  (PI 


<=b.o 


ISO  30. J 


C  "5.0  30.3 

•   OF  PHOCES?(Wc 


190.0 
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Fig.  12.  Skewed  Storage  (m  =  5) 
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